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The Internet is the most complex system ever created in human history. Therefore, its 
dynamics and traffic unsurprisingly take on a rich variety of complex dynamics, self- 
organization, and other phenomena that have been researched for years. This paper is 
a review of the complex dynamics of Internet traffic. Departing from normal treatises, 
we will take a view from both the network engineering and physics perspectives showing 

rf\ ■ the strengths and weaknesses as well as insights of both. In addition, many less covered 

^ i phenomena such as traffic oscillations, large-scale effects of worm traffic, and comparisons 

^i^ ' of the Internet and biological models will be covered. 
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^^ ' 1. Introduction 

oo : 

f^ , In the last ten years, research of networks, especially the Internet, has exploded 

amongst physicists. This follows at least twenty years of research on the same and 
- similar questions among computer scientists and network engineering researchers. 

r> \ Interest in the physics comm unity began with s everal seminal papers on small- world 



C^ ' and scale- free networks (j54t llTTt Il3l : 11381 : [120). Since then, this research has pro- 



gressed at a rapid pace borrowing well-developed tools from statistical mechanics 
and thermodynamics, spectral graph theory, and percolation theory among others 
to enhance the understandings of fields such as Internet topology and social net- 
work analysis previously only researched by the network engineering and sociology 
communities. This new interdisciplinary work has been very fruitful. However, there 
are gaps only more recently being addressed. In particular, while the contribution 
of physicists to the understandings of topology and community structure in net- 
works is substantial, often both physicists (and their counterparts in sociology or 
network engineering) are often unaware at best or dismissive at worst of work on 
the similar questions in other fields. In addition, similar theoretical understandings 
and predictions for network dynamics such as those now being reached for topology 
remain elusive and are still in the earliest stages. There have been many good papers 
written on dynamics by physicists but they have yet to formulate results with the 
same generality or power as the results on topology or make successful predictions, 
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validated with real-world data. 

Just like the ease of accessibility and measurement of Internet topology allowed 
the field of networks to grow, Internet traffic dynamics have provided a similar 
opportunity. Serious research on the macroscopic nature of Internet traffic can be 
traced almost to its inception, however, only in about the last 15 years has the 
field come of age and begun to provide truly deep insights into how communication 
over the largest technological edifice in human history operates. Within this time 
the terms, "self-similarity" , "multifractal" , and "critical phenomena" have emerged 
to try to refashion our ideas about the Internet and how it behaves. This paper 
could just be a review of the work by physicists and a few engineers on network 
traffic, however, my familiarity with the research in the field has convinced me 
that for a full understanding of the state of the art research in Internet traffic 
dynamics, separating the views of engineers and physicists would make any analysis 
incomplete and inadequate. An area for improvement in this research on network 
traffic is the increasing collaboration and cross-citation of works from other fields. 
Though in the study of networks there are notable exceptions, in general, physicists 
and engineers studying the Internet conduct their own research projects, using 
field specific methodologies, and publish in field specific journals with little cross- 
citation of relevant results from the other disciplines. Indeed, one can see from 
the average paper in physics journals such as Physical Review E or Physica A or 
from engineering journals such as the IEEE or ACM series of journals that many 
interrelated problems are being studied from totally different perspectives. This 
has sometimes caused tension, especially between physicists and network engineers, 
about the utility, details, or real- world validation of physics based theories and the 
lack of generality and generally applicable network principles among the engineering 
perspective. 

In figure [1] I have tried to give a full diagram and summary of how these different 
world views operate. If someone is trying to see which view is completely "right" or 
"wrong" they are missing the point that the Internet is in some part everything that 
both sides describe it as. It is an example of an engineering system that is dependent 
on the nature of its protocols and other workings to function. It is also a large-scale 
self-organized system not far removed from those that physicists have studied in 
physical systems for years. However, the research rarely reflects a full synthesis 
of both views. Also, both perspectives are more correct on some points than on 
others. For researchers interested in this area, or even those who feel themselves 



to be seasoned veterans, one of the first papers I recommend is (|9ll ) which is a 
short position paper released by the 2006 CAIDA Workshop on Internet Topology. 
Attended by eminent physicists and engineers, this workshop clearly spelled out the 
promise and peril of multidisciplinary research on Internet topology and traffic. 

In line with the increasing focus on network dynamics, and the reality that many 
of such research projects involves the Internet, this review paper is meant to famil- 
iarize mainly physicists and engineers with the major results of each other and how 
they interrelate. For physicists, hopefully it will provide more exact information on 
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the workings of packet switching systems in the Internet in order to allow us to bet- 
ter test our predictions against reality, build more realistic models and simulations, 
validate models and theories with real data such as traffic traces, and contribute 
to the study of network dynamics through a more complete understanding of the 
dynamics of the Internet. For network engineers, they can see the issues raised by 
statistical mechanics approaches towards network features such as congestion and 
realize that there can possibly be large-scale phenomena quite independent of de- 
tailed technical specifications. Since the Internet probably has the largest readily 
accessible and easily understandable archive of network traffic dynamics, it will 
likely play a huge role in empirically validating theoretical ideas and simulations of 
dynamics in networks. 

This paper is organized in order to provide not only a review of current research 
in the field but to provide a basic introduction in the workings of the Internet for 
newcomers to the field. First, I will review the basic ideas of Internet traffic including 
packets, definitions of flows and throughput, and the basic protocols. While this 
may seem common knowledge, much work in the field can only be understood if 
you have the correct definitions and knowledge of the Internet basics. Therefore, 
this is provided to prevent confusion and perhaps inform on less discussed topics. 
Next, we will discuss the evolution and composition of Internet traffic as far as usage 
and protocols are concerned and study the basic dynamics of packet flow including 
packet size distribution and Internet flow characteristics. The meat of the paper 
involves a detailed discussion of the self-similar nature of Internet traffic and how it 
is defined and measured, the detailed workings and dynamics of the TCP transport 
protocol, as well as the large body of work by physicists on phase transition and 
critical phenomena models on packet switching networks. These techniques will be 
reviewed and their possible promises and current pitfalls addressed. Finally, several 
interesting and related phenomena present in Internet traffic such as oscillations will 
be covered. Each idea is given a firm grounding and a thorough introduction but 
there will be no pretense that I can completely delve into all research on any of these 
ideas in one paper. Self-similar traffic alone has already inspired several volumes on 
even its most esoteric aspects. However, it is hoped this paper will allow someone 
with a reasonably technical background and minimal familiarity with the subject 
and research to quickly grasp the main themes and results that have emerged from 
the research. Even for those that consider themselves experts, there may be small 
insights or details that have been poorly covered in most treatments and may add 
to their knowledge of the subject. 

2. Packets, OSI Network Layers, and Key Terminology 

2.1. Packets and OSI 

In 1969, the Internet (then ARPANET) was first established as a distributed packet 
communications network that would reliably operate if some of its nodes were de- 
stroyed in an enemy attack as well as facilitate communication between computer 
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Fig. 1. Network engineers and physicists often have diverging viewpoints on similar Internet traffic 
phenomena due largely to their backgrounds and training as well as the fundamental questions 
they ask. Here is a summary of those perspectives. On the top and bottom are the background 
knowledge and viewpoints of each side and in the middle are the problems they typically tackle 
and how they ask the questions. 



centers in aeademia. Though the Internet has changed greatly up until today, its 
packet switching mechanism and flexibility remain its key aspects. The packet is 
the core unit of all Internet traffic. A packet is a discrete bundle of data which is 
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Layer 


Number 


Description 


Example Protocols 


Application 


7 


Network applications such as terminal emulation and file transfer 


HTTP, DNS, SMTP 


Presentation 


6 


Formatting of data and encryption 


SSL 


Session 


5 


Establishment and maintenance of sessions 


TCP sessions 


Transport 


4 


Provision of reliable and unreliable end-to-end delivery 


TCP, UDP 


Network 


3 


Packet delivery, including routing 


IP 


Data Link 


2 


Framing of units of information and error checking 


Ethernet, ATM 


Physical 


1 


Transmission of bits on the physical hardware 


lOBASE-T, SONET, DSL 



Table 1. Breakdown of 7 layer OSI model. Descriptions taken from l|13' 



transmitted over the Internet containing a source and destination address, routing 
instructions, data description, a checksum, and data payload. Packet handhng and 
traffic management are governed by a complex set of rules and algorithms collec- 
tively defined as a protocol. 

Different protocols are responsible for handling different aspects of traffic. 
Though this may seem trivial, protocols heavily affect the nature of traffic and 
models of traffic which may be completely valid for one protocol can be completely 
invalid for another. Also, protocols are used in different applications or tasks and 
these should inform any analysis of Internet traffic or predictive models describing 
its behavior. 

In addition, there are levels of tasks handled by certain protocols and not others. 
These are broken out into traditionally seven layers by a model known as the Open 
Systems Interconnection (OSI) model. The seven layers are shown and described 
with examples in table 1. For analysis of packet data, the application, transport, 
network, and data link layer are typically the most relevant. 

The higher layers (higher number) always initiate a lower level protocol. For 
example, for e-mail using the application protocol SMTP, SMTP starts a TCP 
connection which itself uses IP packets to deliver data. 

Even within the same layers though, protocols can function much differently By 
far the most well-known and widely used suite is the transport/network protocol 
combination TCP/IP. Transmission Control Protocol(TCP), which manages ses- 
sions between two interconnected computers, is a connection based protocol which 
means it has various means of checking and guaranteeing delivery of all packets. 
This is why it is widely used to transmit web pages using Hypertext Transfer Proto- 
col (HTTP), email with Simple Mail Transfer Protocol (SMTP), and other widely 
used applications. TCP's connectionless cousin is User Datagram Protocol (UDP). 
UDP sends packets without bothering to confirm a connection or receipt of pack- 
ets. This can make it unreliable for delivery but much faster and more useful for 
real-time applications like voice over IP (VoIP). TCP will be covered in more detail 
and its differences elaborated later in the paper. These differences cause TCP to 
react to feedback in its traffic patterns and adjust its throughput based on these 
considerations. 
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Fig. 2. Structure of a packet in this paper. Proportions based on a 50 byte UDP packet payload. 
Numbers are size of headers or payload in bytes. A is the Link Layer (i.e. Ethernet) header which 
contains MAC address source and destination and payload type, B is the Internet Protocol (IP) 
header, C is the transport layer TCP/UDP protocol header, D is the data payload and E is the 
Link Layer (i.e. Ethernet) CRC checksum hash to prevent accidental corruption of the frame. 



2.2. Packet structure 

Packets have two main parts: a data payload which contains specific data being 
transmitted and overhead which contains instructions about packet destination, 
routing, etc. Each level and protocol has a different amount of overhead as shown 
in table [2l Overhead usually has both a fixed and variable portion. However, for 
most transport and network layer protocols the fixed portion can usually be consid- 
ered the size of the entire header. When dealing with the total size of packets and 
measuring throughput, one must be careful to specify whether or not the packet size 
includes overhead. Also, at the data link layer, there is a maximum frame size of 
1500 bytes in most systems (minus data link layer overhead). The effect of packet 
size and packet size distribution will also be covered in more detail later in the 
paper. 



Protocol 


Header Size 


IP 


20 bytes for IPv4, 40 bytes for IPv6 


TCP 


Normally 20 bytes, can be up to 60 bytes 


UDP 


8 bytes 


Ethernet 


14 bytes for header and 4 bytes for checksum 



Table 2. Packet header sizes for prominent protocols 



2.3. Packet traffic characteristics 

Larger than individual packets is the packet ffow which can be statistically described 
using many important measures. Probably the most widely known and important 
are bandwidth, throughput, goodput, packet flow rate, flow, latency, packet loss, 
and Round Trip Time (RTT). 

Bandwidth - Bandwidth is the maximum possible throughput over a link. Band- 
width, being an ideal, is almost never achieved under normal conditions but provides 
a convenient benchmark to compare the capacity of data links. 

Throughput - Throughput is the rate of packet transmission over a network link. 
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usually in megabits per second (Mbps) or kilobits per second (Kbps). It is the most 
widely recognized measure of network data speed and essential in understanding 
the performance of data traffic. 

Goodput - Goodput is the measure of throughput excluding packet overhead. 
When analyzing data, one must be careful to ascertain whether traffic data is 
throughput or goodput. If it is goodput, the total throughput is actually larger 
because you have to incorporate the average packet overhead in the amount of data 
transferred. However, for both throughput or goodput, the packet flow rate is the 
same. 

Packet Flow Rate - The rate of packet flow over a network link. This differs 
from throughput or goodput in only measuring the number of discrete packets that 
travel over the link, regardless of their size. Throughput and packet flow rate are 
related by the following equation 

T = sX (1) 

Where s is the average packet size including overhead, T is the average through- 
put, and A is the average packet flow rate. 

Session - A TCP communication dialog set up between two computers by first 
the delivery of a SYN packet, coordination request and (SYN+ACK) packet, and 
finally an ACK by the initiating party. 

Flow - This flow must be carefully distinguished from the packet flow rate men- 
tioned above. A flow in Internet traffic is deflned several ways but in general is 
a connection between a source and destination which is continuously transmitting 
data. Usually, this means a connection based protocol such as TCP where a con- 
nection is made and data continuously transmitted until the connection times out 
or a standard inter-packet arrival time is exceeded in the "packet train" (85|; [SS 



Sometimes traffic instead of bytes is also measured in terms of the number of flows. 
The distribution of flow sizes and their properties will be covered later in the paper. 

Packet Loss - This is the percentage of all packets lost in transit. It is usually 
measured as the percent difference between packets transmitted in a packet flow and 
packets received on the other end from the same packet fiow. This affects all traffic 
and is usually caused by link congestion. It has a large effect on TCP throughput. 

Round Trip Time (RTT) - The statistical average time it takes a packet to travel 
from a source to a destination and back. It is the most common measure of latency 
on computer networks. It is closely related to throughput and along with packet 
loss often used as a measure of link congestion. 

2.4. Protocol traffic breakdown 

Overall, TCP dominates all traffic with about 95% or more of total bytes, 85 -95% 
of all packets and 75-85% of all flows using TCP as the transport protocol (jlSSt 



31c |93|). UDP comes second representing about 5% or less of traffic with its main 
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function being sending DNS requests and communications. TCP application traffic 
has generally evolved over time in three main eras characterized by the dominant 
types of traffic influenced by available applications and access speeds. In the Text 
Era (1969-1994) most TCP traffic was driven by email, file transfers, and USENET 



newsgroups. In 1989, Caceres (|24 ) at UC Berkeley characterized Internet traffic of 
being 80% TCP and 20% UDP by packet and 90% TCP and 10% UDP by bytes. 
TCP traffic bytes were split roughly evenly between SMTP for email and FTP 
for file transfer while UDP was mostly DNS. An updated study by Caceres and 



collaborators in 1991 (|25l ) monitored traffic at several universities again finding 
similar results. Once again at UC Berkeley 83% of packets were TCP, 16% were 
UDP, and about 1% were ICMP. UDP traffic was predominantly for DNS at 63% 
of its packets. TCP traffic in terms of packets was 28% telenet, 16% rlogin (Unix 
host login utihty), 12% FTP, 12% SMTP, 12% NNTP (USENET), with the balance 
shared among other protocols. FTP was the largest protocol in bytes at 36% of all 
bytes. These dominant application level protocols were confirmed by Claffy and 



Polyzos as well (|32r ). 

Next was the Graphics or Hyperlink Era (1994-early 2000). After CERN made 
the World Wide Web free for any use in 1993, the graphics based web grew rapidly. 
In 2004, Paxson |l4ol ) reported that in Internet traffic though FTP, SMTP, and 
NNTP still held sway, HTTP was by far the fastest growing protocol growing 300- 
fold in traffic measured by connections in only two years and already vying to be 
the 2nd most popular TCP application level protocol. By 1995, WWW traffic had 
become the largest application level protocol with 21% of traffic by packets com- 
pared to 14% for FTP 8% for NNTP and 6% for SMTP ^. By 1997, Thompson, 



Miller, & Wilder (|l68l ) could report that HTTP dominated TCP traffic, 95% of 
all Internet traffic bytes at this point, with 75% of the overall bytes, up to 70% of 
the overall packets, and 75% of the overall fiows during daytime hours. Its closest 
competitor, SMTP, was reduced to only 5% of packets and 5% of all bytes. The 
Internet was now a popular, mainly web and graphics based medium. 

The current era is the Multimedia Era (early 2000-present). In this period, 
sharing of multimedia through P2P file sharing applications and streaming audio 
or video began to rival the web for dominance of Internet traffic. A Sprint study on 
an IP backbone in early 2000 (i66) reports that P2P was already rivaling the web 
in terms of bytes transferred with at times P2P accounting for 80% of all traffic. 
Streaming also accounted for as much as 26% of all traffic as well. The web was 
still competitive, however, sometimes accounting for 90% of all traffic. By 2004, 
however, Fomenkov, et. al. ([6J) could report WWW traffic clearly peaked in late 
1999/early 2000 and P2P had dominated traffic growth ever since. A recent April 
2008 traffic trace study ^ shows the Web and P2P sharing 34% and 33% of total 
TCP/IP bytes respectively. However, P2P only accounts for about 3% of all flows 
compared to the 40% of all flows dominated by the web showing P2P flows are 
generally larger and more likely to be "elephant flows". Another earlier study by 



the same team (jSlf) gave similar results with Web and P2P (normal and encrypted) 
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consisting of 41% and 38% of bytes and 56% and 4% of flows respectively. 



2.5. Topology 

As mentioned before, topology is currently the most studied feature of the Internet 
and other computer networks by physicists. Due to the wide range and depth of re- 
search being done in this field, this paper will not present even a cursory review of its 
main ideas and results. Dealing with Internet traffic, however, it is important to bal- 
ance the knowledgeable networking engineering perspective with the more abstract 
methods used by physicists. For the now widely familiar met hods of physicists, th e 
author recommends several outstanding review papers (J5l: ll27tl42l : ll8l : ll57tll5d : ll39l ). 



However, anyone measuring or analyzing network topology network must read net- 
work engineering contributions and rebuttals to many of the now standard scale- free 
network techniques and theories. This is usually poorly cover ed o r not at all in the 
physics community. In particular, the author recommends (jl82l : l45l : llOTt Il08t |7|) 



which tackle the power-law distribution and "robu st ye t fragile" nature of the In- 
ternet that physicists have theorized. Also in (|92| m8 H109 !) the issues of data integrity 
in measuring the Internet topology are described including the errors introduced 
by several common methods such as traceroute and BGP router tables. Essential 
reading for anyone looking at this area. Finally, as a guide before claiming any- 
thing as a power law, the author urges these two excellent articles on tackling this 
determination in a rigorous fashion (|34i : Ill5t IllTT ). 



3. Packet Sizes 

3.1. Distribution of packet sizes 

Internet traffic, with its various protocols and traffic types, has many widely varying 
packet sizes. However, there is an upper limit to packet size and this is almost always 
determined at the data link layer. Various data link communications schemes, such 
as Ethernet or ATM, impose an upper bound on the size of transmitted packets 
through the hardware or operating system settings. This upper bound packet size 
is often designated the Maximum Transmission Unit (MTU) at the link layer. 

In Ethernet, the current MTU on most systems is 1500 bytes. Packets at the 
data link layer are often termed frames but the idea is the same. This 1500 bytes 
includes the payloads and headers of all lower level protocols but does not include 
the Ethernet frame header and footer. Several studies on packet size distributions 
have shown that packet size is in general a bimodal or trimodal distribution with 
most packets being small (500 bytes or less) (JQOt 1 1681 : l43t l31t 1931 ). In addition, the 



distribution of packet sizes is not a smooth long-tailed distribution in that some 
packet sizes can predomi nate due to system defaults. 

For example, ( 90l : ll68r ) describe that there are peaks in the frequency distributi 



ion 



for packet sizes. In a traces of data over a day or longer on a data link, they explain 
many reasons for the small packet size. First, for TCP systems there is a protocol 
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10 

option for "MTU discovery" that tries to find the MTU of the network in order to 
make packets as large as possible. If MTU discovery isn't implemented, TCP often 
defaults to an MTU of 552 or 576 bytes. Also, nearly half of the packets are 40-44 
bytes in length. These packets are used by TCP in control communications such as 
SYN or ACK traffic to maintain the connection between the source and destination 



systems. At 576 bytes (|90f ) the packet size increases linearly to 1500 bytes showing 
that packet sizes in t he in termediate region are relatively equally distributed. In 
general, according to (|l68l : Isil : l93j) about 50% of the packets are 40-44 bytes, 20% 



are 552 or 576 bytes, and 15% are 1500 bytes. Table [3T] shows the distributions of 
packet sizes from a traffic trace and fit well with the studies except the absence of 
a strong peak in the 552 or 576 byte range. 



Packet Size Range 


ALL 


TCP 


UDP 


0-19 


0% 


0% 


0% 


20-39 


2% 


0% 


0% 


40-79 


59% 


69% 


19% 


80-159 


7% 


2% 


23% 


160-319 


3% 


1% 


15% 


320-639 


7% 


3% 


34% 


640-1279 


3% 


3% 


6% 


1280-2559 


18% 


22% 


4% 



Table 3. Pac ket size distribution of a capture of 1 million packets in a 100 second trace from the 
MAWI {Til) traffic trace archive from Samplepoint-B on July 22, 2005. 



Kushida (|93l ). is one of the only papers that looks at packet size distribution 
among UDP separately though clarifies that since 98.2% of the traffic measured in 
the paper is TCP, the UDP contribution to overall IP and Internet traffic packet 
size distribution should be considered negligible. Using a different measurement for 
packet size distribution, that looks at the ratio of packet size*number of packets 
versus the total traffic measured, Kushida finds a series of peaks between 75 and 
81 bytes and another large peak at 740 bytes. None of these peaks are substantial, 
however, and no size of packet reaches even 10% of the total. Since UDP has no 
connection based features such as TCP the reason for these peaks is not inherent 
in the protocol itself. UDP is mainly used for Domain Name Server (DNS) and 
Simple Network Management Protocol (SNMP; a network monitoring protocol) 
and applications related to these functions drive the size of the UDP packets. 

Finally, there is often an asymmetry in packet size for both directions in a flow. 
For example, if a web page is being served to a PC, the PC will be receiving large 
TCP packets with HTTP (WWW) data while it will only be sending comparatively 
smaller packets as data requests back. Packet size also has diurnal variations where 
it can be larger during daytime hours. Also, in an international link, [90) showed 
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that the average packet size on both directions of the link oscillated out of phase 
by about 11 hours (2.9 radians). 

Does packet size or MTU matter? Absolutely, in fact many network engineers re- 
alize that average packet size and MTU are critical factors in determining the overall 
maximum throughput in a network. Recalling equation [T] for a fixed throughput, 
decreasing the packet size increases the packet flow rate. Oftentimes, many believe 
that the key throttle in computer network throughput is its stated bandwidth. 
In fact, bandwidth bottlenecks are rarely the bottleneck on network performance. 
Computer network hardware typically has a maximum packet flow rate it can effec- 
tively handle, afterwards packets begin forming queues in the hardware buffer and 



congestion reduces throughput. Smith, in (|l54f ) showed how on a normal Ethernet 
link between two computers, the maximum throughput across varying packet sizes 
exhibited a transcritical bifurcation. 

In fact, one problem currently plaguing next generation high speed networks is 
outdated, smaller MTUs on the systems of their users. Therefore, in order to take 
advantage of the increasing bandwidth capabilities of the Internet, there is a con- 
certed push in some corners to raised the typical MTU above the normal Ethernet 
1500 bytes, up to 9000 bytes where possible, to allow more rapid communication. 
There will need to be larger studies on network hardware such as routers to un- 
derstand how the MTU completely affects throughput and whether the bifurcating 
behavior is present for saturated links in the Internet at large. 

4. Flow Size Structure and Distribution 

4.1. Definition and nature of flows 

As mentioned in the definition of an Internet flow, a flow is deflned as continuous 
communication between a source and destination system. Flows are typically de- 
scribed by one of two definitions: an identifiable clustering of packets arriving at a 
link or by identifying characteristics such as the source and destination addresses 
along with an identifying label such as a TCP session ID or an IPv6 flow label. 
For the first definition, the most widely used definition was given by Jain and 



Routher in 1986 (|85l ) while studying data on a token ring network at MIT. While also 
noting that the interpacket arrival rate is neither a Poisson or compound Poisson 
distribution, they defined individual fiows as "packet trains" where a packet train is 
defined as a sequence of packets whose interarrival times are all less than a chosen 
maximum interarrival gap, usually determined by system software and hardware 
configurations. If a packet is received after a longer interval than the maximum 
gap, it is considered part of a new fiow. This brings up one important characteristic 
of Internet flows. Though they obviously have a time average, they are extremely 
bursty and inhomogeneous compared to most other types of flows studied in physics. 
For the second method, the first and still likely most widely used method of 
identifying flows via address or label is using TCP packets. TCP flows start with a 
SYN packet and end with a FIN packet. Therefore matching SYN and FIN packets 
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with source and destination IP addresses and session ID in the TCP headers are 
often used to define flows. Another elaborate definition was presented by ClafFy 
et. al. (|33l ) who represent a flow as active if there is interpacket time less than a 
maximum value and distinguish flows by a group of packets identified by aspects in- 
cluding source/destination pairs, unidirectional nature (fiows in only one direction), 
protocols used, and other factors that may distinguish the packet destinations. 

The new next generation Internet Protocol, IPv6, though not yet implemented 
widely beyond a now defunct test network called 6Bone, has been designed with 
a part of its header overhead reserved for a "flow label". This flow label would 
allow the traffic source to provide a unique identifier that would clearly distinguish 
IP traffic fiows. Besides improvements in routing and traffic management, this will 
allow more accurate research as IPv6 is implemented throughout the Internet. 



4.2. Distributions of flow characteristics 

Early in the paper, it was mentioned that long-tail behavior is present in Internet 
traffic to the same extent as it is in the topology. Flows are no exception and 
several quantities used to describe flows have long-tail distributions. In particular, 
the distributions of sizes in terms of data transferred, duration in terms of length of 
the flow, and data rate of flows have all been found to exhibit long-tail distributions. 
These flows have been given certain names throughout the literature which are 
summarized by Lan and Heidemann (|10Q) . Flow sizes are divided into two classes: 
"elephants" and "mice "where elephants are a small part of all flows measured over 
a certain time but account for a large number of the bytes transferred while many 
other flows account for proportionally smaller components of the overall traffic (the 



mi ce). Eleph ant fiows have been described in detail in several papers (jlOQi : i53l : l55l : 



99l : ll2ll : ll32h . in particular in a paper by Mori et. al. describes a traffic trace where 



elephant flows are only 4.7% of all flows measured but 41% of all traffic during the 



period. Barthelemy et. al., (|l6l ) give a related result studying routers on the French 
Renater scientific network. They conclusively find that a small number of routers 
(a so-called 'spanning network') submit the vast majority of data on the network 
while the contribution of the other routers is exponentially smaller. 

Elephant flows, though agreed upon in principle, have been deflned differently in 
many papers. Estan ([53) defined an elephant flow as a flow that accounts for at least 
1% of total traffic in a time period. Pa pagia nnaki ([132) uses flow duration to classify 



elephant flows. Lan and Heidemann (jlOOl : IQQI ) use a statistical definition where a 



fiow is considered an elephant flow if the amount of data it transmits is at least 
equal to the mean plus three standard deviations of flow size during a period. This 
is 152kB in their paper. This final definition implicit assumes the scaling exponent 
among fiow sizes, a is at least 2, since the variance for the distribution is infinite if 
a < 2. 

In figure H the author has used data from the WIDE MAWI (fni) traffic trace 
archive which measured the daily traffic over a T-1 line between Japan and the 
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Western US to show the relative proportion of all traffic the top 10 flows represented 
over time from 2001-2007. The upward tick in mid-2006 reflects the upgrading of 
the data link speed from 100Mbps to IGbps. The % of all traffic captured by the 
top 10 flows declines over time as the number of overall flows per day increases and 
the top 10 occupy a declining share of the number of flows. 




Fig. 3. Percent of da ta in all flows occupied by the top 10 flows over time. From the WIDE MAWI 
traflic trace archive l lll3h using data from Samplepoint-B from 12/31/2000 - 5/31/2007. Median 
daily flows total about 350,000 



Research from Mori et. al. (|l21r ) also gives evidence that elephant flows not only 
occupy disproportionate amounts of traffic but they are also more likely than mice 
to be responsible for congestion in links. 

The duration of flows has been classified with similar zoological flair. Most flows 
have a relatively short duration while a small number of flows have a comparatively 
very long duration. Brownlee and Claffy (J22l ) analyzed duration among Internet 
streams, which are individual IP sessions versus one way flows of packets typifying 
flows. About 45% of streams were of a very short duration, less than 2 seconds, 
and were termed "dragonflies" . Short streams were defined has having a duration 
from 2s to 15 minutes and consisted of another 53% of all flows. "Tortoises" were 
flows with a duration greater than 15 minutes and accounted for 1-2% of all streams 
but 50% of all bytes tr ansfe rred. The dragonfly/tortoise definition is simplified and 
extended to flows in (jlOOl) where a dragonfly is a flow less than the mean flow 
duration plus three standard deviations which is 12 minutes in their paper. They 
find 70% of all Internet fl ows are less than 10 seconds. 

Lan and Heidemann (|lO0l ) also introduce a new measure of flow, "cheetahs" 
and "snails" to characterize the distribution of throughput in flows. Cheetahs are 
flows with an average throughput greater than the mean plus three standard de- 
viations. Their dividing throughput is 101 kB/s in the paper. According to their 
measurements, about 80% of Internet flows have a throughput of less than 10 kB/s. 
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These different types of measurements on flows are obviously not independent 
and in fact are heavily correlated in several ways. Cheetahs tend to be high through- 
put but small in size and short in duration Zhang |l6l ) pr evio usly showed a correla- 



tion between flow size and rate and Lan and Heidemann (jlOOJ) confirm this showing 
95% of cheetah flows are dragonflies with a duration of less than 1 second. 70% of 
cheetah flows are also smaller than 10 kB. Elephant flows tend to be la rge in size 
and duration but low in throughput. Only 30% of elephant flows in (lOOj are faster 
than 10 kB/s and 5% are faster than 100 kB/s. 50% of elephant flows lasted longer 
than 2 minutes and 20% of elephant flows lasted at least 15 minutes. 

Different flow types are also dominated by different types of traffic. Elephants 
are mostly web and P2P traffic while Tortoises are mostly DNS. Cheetahs have by 
far mostly web and DNS traffic. 

The granularity and nuance in the characteristics of flows is an interesting the- 
oretical and practical challenge for those studying Internet dynamics. But it still 
gets even better as our next section on the self-similarity of traffic demonstrates. 



Category Large-size Long-lived Fast Bursty 



Elephant 


Y 


Tortoise 


N 


Cheetah 


N 



Y 
Y 

N 



N 


N 


N 


N 


Y 


Y 



Table 4. Classification and description of flows from Lan and Heidemann 111001 ) 



5. Packet Arrival Times - Self-Similarity, Long Range 
Dependence, and Multifractals 

5.1. Self -similar traffic and long range dependence 

One of the most widely researched and discussed characteristics of Internet data 
traffic among both the computer science and physics communities is the self-similar 
nature of Internet packet arrival times. Interestingly enough, the trajectory of re- 
search on this begins with a shattering of simplistic preconceptions about network 
traffic similar to that of Barabasi and Albert regarding Internet topology. Like pre- 
Barabasi/Albert theory assumed all telecommunications networks, including the 
Internet, were random graphs, early research in Internet traffic regarded packet 
arrival times as based on a Poisson (Erlang-1) or Erlang-k distribution similar to 



that in telephone switching and call center traffic l|l84l ). The first cracks in this 
were a paper by Leland and Wilson (l04j which showed packet interarrival times 
that seemed to exhibit both diurnal fluctuations and did not seem to adhere to a 
Poiss on distribution. The second paper by Leland, Wilson, Taqqu, and Willinger 
(|l05l ) thoroughly and convincingly debunked the theory of Poisson arrival of pack- 
ets in Internet traffic and using rigorous statistics showed that Internet traffic had 
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self-similar characteristics and correlations over long time scales (long-range depen- 
dence or LRD). Like degree distributions in scale- free topologies, the packet arrival 
per unit time exhibited long-tail distributions where large bursts of traffic were not 
isolated and extremely rare statistical coincidences but par for the course over all 
time scales. The studies were based on four captures of data over four years. The 
traffic traces, taken at the former AT&T Bellcore research facility, varied from 20 to 
48 hours in length and recorded the timestamps of hundreds of millions of packets. 
The authors are also the first to describe Internet traffic as having a fractal 
character. This research has since been confirmed in a torrent of papers which 



are too numerous to describe. Paxson and Floyd (jl42l ) confirmed the failure of 



Poisson modeling in long-tail traffic behavior in Wide Area Network (WAN) data 



in several protocols including TCP, FTP, and Telnet. Crovella and Bestavros (|38l ) 
demonstrated long-tail distributions in WWW traffic including packet interarrival 
times, file download size distributions, file download transmission time distributions, 
and URL request interarrival times. Other paper have essentially confirmed in most 
cases that long-range dependence is a key feature of Internet packet traffic. 

5.2. Measuring self-similarity and long range dependence 

There are several good review articles detailing the mathematical tec hniqu es used 
to in vestigate self-similar processes in network traffic (134i : ISTt Il82t l52l : Il46l : y; llSlt 



184[ ). Here we will cover the most prevalent and important ones. 

The simplest definition for self-similarity assumes that for a continuous time 
process, X{t) t > for scaling the time by a factor ci, 

X{t) = c-"X{ct) (2) 

where H is the Hurst exponent and takes a value between and 1 for self- 
similar processes. For a self-similar process that exhibits long-range dependence, H 
is between 1/2 and 1. This definition, like most other for self-similarity, implicitly 
assumes a stationary process. 

The most accepted and widely used definitions are termed the so-called first- 
order and second-order similarity. First order similarity is based on the autocorrela- 
tion of the traffic trace. Assuming a traffic trace is defined as a stationary stochastic 
process X with a set of values at time steps t: 

X = {Xt -.1 = 0,1,2...) (3) 

and the autocorrelation function p{k) is defined as 

pik) = ^li^^-^)i^^^^~^}] (4) 

Where fj, is the mean and a^ is the variance of the traffic. The self-similar 
behavior is manifested in that the behavior of the autocorrelation function is not 
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one which exponcntiaUy decays with time as with a short range dependent time 
series but rather exhibits a power law behavior 

pik) - C2k-^ < /3 < 1 (5) 

Where C2 is a positive constant and the approximation symbol indicates this 
behavior is the asymptotic behavior of the system as k —> oo. Fitting a linear 
regression to an autocorrelation or autocovariance plot should not be considered 
a rigorous or best practice method of determining self-similarity and the Hurst 
exponent. There are various other tools, with shortfalls as well, that are best used 
to make an accurate determination. 

Second order similarity/aggregated variance analysis 

Second order similarity is defined as taking the original time series and recreating 
it for different time "windows" m where all time values in the series in windows of 
length m are averaged. For example, the new time steps become i = 0, m, 2m.., N/m. 
Second order similarity, also known as aggregated variance analysis, is formally 
defined as taking the new time series 

X^"^ = l/m(Xfc„_™+i + . . . + Xfc„0 (6) 

for all TO = 1,2,3,.... The time series is called exactly self-similar if the variance 
of l/ar(X(™)) = ;^ and 

p(")(/c)=p(fc) fc>0 (7) 

For a normal independent and identically distributed time series the variance 
would behave as Var{X^''"-^) ~ ^—. With self-similarity it decays much more slowly 
given the range of < /3 < 1 . The time series is called asymptotically self-similar 
if the autocorrelation function of the new time series for large k behaves as 

/'")(fc)^p(fc) TO^OO (8) 

For both definitions of self-similarity, the Hurst exponent H can be derived 
from the value of /3 according to the equation H = I ~ (3/2. This confines the 
Hurst exponent to values of between 1/2 and 1 for a self-similar system. Note H = 
1/2 exponent is identical to that of random Brownian motion and H = \ reflects 
complete self-similarity. In most studies, H is estimated to be around 0.8 in most 
types of Internet traffic. The results from the data trace analyzed by the author in 
figure m give a Hurst exponent of 0.81. 

One must take care to differentiate two similar but not identical aspects of 
Internet traffic: self-similarity, just defined above, and long-range dependence. Long- 
range dependence is defined as a system where the autocorrelation function behaves 
as 
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Ei^wi 



oo 



(9) 



when H > 1/2 for self-similar traffic long-range dependence is implied but in 
other conditions you can have long-range dependence but not self-similarity as long 
as equation [5] is satisfied. Long-range dependence is also caUed persistence and is 
contrasted by short-range dependence (SRD) which manifests in processes where 
< H < 1/2. LRD also depends on an assumption of stationarity in traffic which is 
reasonable on timescales of minutes to hours but is less useful over large timescales 
due to diurnal traffic variations and long-term trends. 

R/S Statistic 

Again, we separate the time series into m equal blocks of length N/m except 
all values in each block are aggregated by simple summation. Define n as n — N/m 
and define the range R{n) as the difference between the value of the largest block 
and the smallest block. Define S{n) as the standard deviation of the values of the 
blocks. The ratio R{n)/S{n) should scale with n such that 



E[R{n)/S{n)] - cgn^ (10) 

Note that one problem with both the R/S and other methods such as aggregated 
variance is choosing the right range for the sizes of the blocks (|35f ) . Choosing values 
of m that are too small makes short term correlations dominate, while a large m 
has fewer blocks and gives a less accurate estimate of H. One approach created to 
deal with this issue is wavelet analysis of the logscalc diagram which is covered in 
the next section on wavelet methods. 

Periodogram 

An additional test for long-range dependence is the presence of 1// noise in the 
spectral density of the time signal at low frequencies. The exponent of 1// noise is 
related to /3 as well where /(A) = cX~'^ where c is a constant (unrelated to previous 
ones), A is the frequency and < 7 < 1 and 7=1-/3. 

Often, the spectral density, /(A) is estimated as 



/(A) 



2ttN 



N 



(11) 



Whose log-log plot slope should be close to 1 — 2H near the origin. 
Scali ng of Moments 

(|l67^ use the fact that the moments scale with the length of the time series to 
identify self-similarity. Define the absolute moment as 



/")(g)=£;|X(" 



= E 






i=l 



(12) 
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The absolute moment ^'^'"^\q) scales as 

\og^i('^\q)^P{q)\ogm + C{q) (13) 

Where /3(q) =q{H -I). 

Table 5. Relationship among key exponents 



H=l- P/2 
(3^2{l-H) 

7=1-/? 

13 = 1-1 

7 = 2i7 - 1 

iJ=(7 + l)/2 



An excellent guide to measuring the Hurst parameter can be found in pf 
Though the Hurst exponent is well-defined mathematically, in practice all mea- 
surements of it are only estimations and different techniques, software, or noisy 
data sets can produce varying estimates. Even on artificially generated data with 
a known Hurst exponent, the different techniques had divergent measurements of 
the Hurst exponent and the R/S statistic performed poorly underestimating H on 
both generated and real data. 

Many may realize that in all of this discussion of self-similarity and fractals the 
fractal dimension has not been mentioned once. The omission is purposeful and due 
to the convention that almost without exception, the Hurst exponent is used as 
the measure of self-similarity in data traffic research. In any case, the conversion is 
not difficult since the fractal dimension D of the time series is related to the Hurst 
exponent by 

D = 2- H (14) 

Given equation [M] we can see that the typical fractal dimension of data traffic 
is around 1.2. 

Nonstationary data methods 

As noted earlier, the previous techniques to measure self-similarity implicity 
assume a stationary signal, however, this is definitely not the case in Internet traffic 
where on longer time scales, non-stationarity due to periodicities such as daily usage 
patterns and growth in traffic over time make the traffic data nonstationary. 

A commonly used method of measuring long-range correlations and self- 
similarity in nonstationary time series traffic is the use of detrended fluctuation ana l- 
ysis (DFA), an approach adopted independently in several papers (|l03Hl58t Il98l ). 



DFA was first used to measure the long-range correlations in non-coding regions of 
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DNA and is often used to measure correlations among fluctuations in physiological 
or financial time series. In short, DFA is a modified RMS which calculated the de- 
viation from a trend and long-range correlation in a time series. To use DFA for a 
time series X{t) of length N, first calculate the profile y{t) given by 



yit)=Y,[Xiz)-{X)] (15) 

1=1 

where 



1 ^ 
W = ^E^« (16) 



The next step involves separating the signal into m equal sized non-overlapping 
segments. In each segment, use least squares regression to find the local linear trend 
ijt in the segment and then calculate the detrended profile of the signal, ym(t) where 

Vmit) = yit) - yt (17) 

and finally the detrended rms is calculated as 



F{m) = 



\ 



1 ^ 
ttE^^W' (18) 



i=l 



if the signal has long-range dependence from a 1// spectrum, F{m) will scale 
with m as 



F(m) - m" (19) 

a is related to the 1// exponent of the signal by 7 = 2a — 1, which superficially 
makes it identical to the Hurst exponent. Give the RMS, the a measured is a second 
order measurement of the power law scaling. The Hurst exponent is most simply 
extracted by taking the mean value of a. 

DFA is not the only method for analyzing nonstationary time series and as 
some recent studies conclude, is neither always accurate nor optimum for analyzing 
nonstationary trended processes. In f 14!) it is mathematically shown that for trended 
processes DFA estimates for the Hurst exponent do not converge to an accurate 
value and the wavelet method (next section) is recommended instead to measure 
the Hurst exponent. Therefore, estimates of H from papers using the DFA method 
should be viewed with skepticism and tested against other methods. 
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5.3. Wavelet methods 



The final methods usually used to measure self-similarity are wavelet me thod s and 
are often the preferred method for nonstationary data. In (JSOI : 1^; l58|; |73|; Il69l ). the 
logscale diagram method of interpreting Internet traffic is described. A logscale 
diagram is created using discrete wavelet analysis of the signal, where the signal, 
X{t), is represented as filtered through a wavelet defined given a timescale j and 
time instant k as 



V'j,fe(i) = 2-J/V(2"'i-A:) (20) 

A typical wavelet used in the analyses of these discrete wavelets is the Haar 
wavelet. Applying the wavelet transform the signal can be represented as 



X{t) = Y, cxUo, k)cj,,,^u + E E ^^(J' k)^iAt) (21) 

fc j<jo k 

Where cxUo,k) are called the scaling coefficients, is called the scaling function, 
and dx {j, k) are called the wavelet coefficients. Wavelet theory will not be covered 
in great detail here due to the complexity of the subject, however, there are several 
useful guides (|l43l : Il28t ISGJ ) to the subject. Each scale increment j represents a 
scaling of the timescale of an order 2^ and j is commonly termed the octave. In 
addition, in (2) it is shown for a stationary, self-similar process that the expectation 
of the energy Ej that lies around a given bandwidth 2^^ around the frequency 2^^Ao 
where, Aq is the sampling frequency is 



E[Ej] ^ E 



^EMm-P 



Nj 



(22) 



Where d are the wavelet coefficients of octave j and Nj is the number of wavelet 
coefficients in the octave j. Graphing the log of E[Ej\ versus the octave j, gives a 
logscale diagram, an example of which is the bottom graph in figured In addition, 
E[Ej\ is also related to the sampling frequency and the Hurst exponent: 

E[E,] = c|2-no|i-2ff (23) 

So the logarithm of the expected energy is directly proportional to the Hurst 
exponent. In fact, monofractal behavior is indicated by a linear dependence of 
\ogE\Ej\ over multiple octaves. The different scaling regimes can be seen in fig- 
ured by noting how the curve varies over the octaves 2 to 4 and 8 to 12. 

The second, and often considered more rigorous, method of looking at changing 
self-similarity using wavelets is looking at the scaling of the partition function for 
each moment of order q over each octave where the partition function is defined as 



April 20, 2009 3:5 WSPC/INSTRUCTION FILE bigreview-revised-wcs 



21 



Siq,j) = J2\2-^/'dxU,kW (24) 

k 

The scaling behavior, besides being seen by graphing log2S{q,j) vs. j is encap- 
sulated using what is called the structure function 

log S{q,j) 

T{q) = hm — -— (25) 

j^-oo J log 2 

if the traffic is exactly self-similar with Hurst exponent, H, then for each q, 
T{q) = Hq — 1. When more than one scaling behavior is in the signal, T{q) is no 
longer linear, but concave and each scaling exponent contributes to its value roughly 
accordi ng to its relative strength in the signal at the relative timescale. For details 



see (|59|; 11691 ) 



There have been some theories of multifractal behavior in Inte rnet traffic by 



traffic engineers. This thesis was first broached and analyzed in (jl45f ). In short, it is 
believed that self-similarity is generated on long timescales by user and application 
needs (see next section on the ON/OFF model) but on shorter timescales the picture 
is much less clear. It is possible that a different self-similarity affected by network or 
TCP congestion control considerations is active in this region. Early papers tried to 
explain multifractal properties of traffic by a process known as multiplicative cas- 
cades or conservative cascades (59|;|73|; |71). The cascade is mathematically defined 



as a mass M equally distributed over the interval (0, 1] where the mass is broken 
up into two new masses, one with mass p and the other with mass 1 — p, where p is 
a fraction of mass defined for the process, and these two new pieces are broken up 
again according to the same process ad infinitum. The multiplicative cascade model 
was rationalized in the relation to Internet traffic by describing the encapsulating 
of fiows into packets and the fragmentation of these packets at the link layer as 
a conservative cascade process where the total transmitted data is conserved but 
broken down into many different packets. Since this process occurs over relatively 
short time scales, it is given as additional evidence for the cause behind the different 
scaling at shorter time scales. 



Many questions about the multifractal paradigm, however, were raised in (jl70l ) 
which openly criticized some of these claims and questioned whether multifractal 
models are necessary and as proven as they are purported to be. In particular, 
Veitch, Hohn, & Abry, while analyzing some of the most common data traces used 
in Internet traffic studies, raise questions about the rigor of the statistical tests used 
such as logscale diagrams, especially given large confidence intervals for some values 
of the energy at higher octaves. They also raise the point that these tests rely on 
an assumption of stationarity in Internet traffic which may not always be a valid 
assumption, especially over longer timescales. In the end, they do not completely 
rule out multifractals, however, they raise the point that current statistical tools are 
not yet fully developed enough to give a definite answer to existence of multifractals. 
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Similar comments are made in (|l83r ) declaring that multifractal patterns may exists 
but are not to be seen as an end in themselves and any new model of multifractality 
must be matched with a feasible mechanism. 

Self-similarity and long-range dependence account for the "bursty" behavior of 
Internet traffic at all time scales. Unlike telephone traffic, which is Poisson and 
large spikes are rare deviations from a mean traffic level and have an exponentially 
decreasing probability, burstiness in Internet traffic at almost all-scales has a non- 
vanishing probability. This makes traffic management schemes and infrastructure 
planning much more difficult from a statistical standpoint. Sometimes, a scheme 



known as "small buffers, high bandwidth" (|l34f ) is advanced to deal with bursty 
traffic to avoid trying to create massive buffers to handle bursts of traffic. However, 
there is not yet an easy answer to managing Internet traffic, especially one with 
practical use. 
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Fig. 4. A view of a 10,192 second trace of IPv6 6Bone experimental network Internet traffic 
taken from the WIDE MAWI traffic trace archive Samplepoint-C on July 22, 2005. The data was 
collected into Is intervals. The following figures show the packets/s of traffic, the distribution of 
packet arrivals, the autocorrelation of the time series up to a lag of 1000, the 1// noise plot of the 
data trace, and the logscale diagram constructed from wavelet coefficient data based on 100ms 
bins of packet arrivals and 95% confidence intervals. The Hurst exponent was calculated with the 
statistical program R with the fractal package using the aggregated variance method estimating 
H = 0.81. 
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6. Theories on the causes of self-similar traffic 

Once self-similarity was demonstrated in Internet traffic, the next logical step was to 
look for the cause. This is where most of the tension between the different fields has 
developed and sometimes clashed. Though some consider the question controversial 
and unresolved, there are some models which currently have a greater weight of 
evidence behind them as checked against real- world traffic. There are three main 
theories which will be discussed at length throughout the rest of the paper. 

6.1. Application Layer Cause: Long-tail ON/OFF sources 

First, there is the most widely known, and only empirically validated, application 
layer theory which states th at self-sin iilar traffic is the cause of the behavior or users. 
This was first elaborated in (|l79tll8Cll ). This theory models the traffic on the Internet 
as large number of ON/OFF so urce s with identical duration distributions, an idea 



earlier broached by Mandelbrot (jllOf ). The ON/OFF sources, which reflect flows, are 
superimposed traffic sources that alternate in ON and OFF periods according to a 
power-law distribution. Though this is not usually explicitly mentioned, this model 
is extremely close to modeling a large number of flows with long-tail s dominat ed by 
a few elephant/mice flows found in actual traffic measurements. In ( 179l : ll80l ) they 



give evidence both from theory and observation that many superimposed ON/OFF 
sources behave to a limit as fractional Brownian motion and can account for the 
self-similarity seen in overall Internet traffic. Specifically, the Hurst exponent of the 
traffic of the superimposed is calculated by 

3 -inm{ao N,aoFF)) ,„„s 

^ = 2 *• '' 

where aoN and aoFF are the power law exponents of the ON and OFF dis- 
tribution times respectively. ON/OFF flows with long-tail duration are not merely 
a theoretical abstraction but have a basis in the access of files over the Internet. 



Internet file sizes have been shown by several studies (|l2l : l44t 1741 Ill6f ) to be at least 
long-tailed though there is dispute over whether the distribution more closely fits a 
power law, lognormal, or double Pareto. 



Crovella & Bestavros and Crovella, Park & Kim ( 136l : l38l : ll35l ) extend the model 



to explain the self-similar nature of TCP and web traffic. They base their studies 
on long-tailed file sizes with a power-law exponent of about 1.2 (implying a Hurst 
exponent of about 0.9 from equation I26|) . They also found that in general, reducing 
the tail of the file length distribution by increasing the power-law exponent, lowers 
the Hurst exponent as expected in (|l79l ; ll80r ). This theory of ON/OFF sources, and 
its variants, has become the most dominant explanation for self-similar network 
traffic in most network engineering papers. It also is a frequently used model for 
simulating data traffic in Internet simulations. 

The second theory, discussed more in detail in the next section, considers origins 
of the self-similar traffic at the transport layer. In particular, it looks at possible 
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effects the TCP congestion control algorithms may have on network traffic given the 
feedback and collective behavior it can engender among multiple traffic sources over 
the same path. The third main theory, discussed in the section on phase transition 
models of traffic congestion is that which connects self-similarity to critical phe- 
nomena in data traffic near the transition point from free flow to congested traffic. 
This is the model currently most favored amongst physicists and underlines many 
of the models of data traffic that will be described later. In addition to describing 
these models, the insights and flaws of each will be described in detail. 

7. TCP Throughput and Congestion Control Phenomena 

As stated earlier, TCP by far is the bulk of Internet traffic. Therefore, any discus- 
sion on Internet traffic is by and large a discussion of TCP/IP traffic. TCP is a 
connection based protocol and relies on several programmed algorithms to manage 
and guarantee the delivery of packet traffic. The fact is, however, TCP was devel- 
oped when the Internet was relatively small. Though it is still useful and efficient, 
the large-scale macroscopic effects of its operation were not easily predictable and 
were only measured or derived later. 

Volumes of articles have been written on TCP behavior, possible algorithmic 
improvements, and traffic management. TCP has several features including buffer- 
ing and congestion control that allow it to be one of the only Internet protocols 
that uses feedback to adjust protocol performance. Nonlinear effects combined with 
feedback have been well-known to produce complex systems phenomena and TCP 
is no exception. In this section, TCP's basic mechanisms will be deflned and ex- 
plained and then linked with the most common theories of network performance 
and congestion. 

7.1. A short explanation of TCP 

There are many goo d gu ides on TCP, but most information in this paper is taken 



from an IBM guide (|l37l ). TCP relies on several key features which are necessary 
to ensure reliable and smooth delivery of packets between the source and receiver. 
The TCP connection starts out with a "three-way handshake" which consists of 
one SYN (synchronize) and two ACK (acknowledge) packets. TCP flows are based 
on the concepts of windows and flow control. When packets are transmitted they 
are given sequence numbers to determine the correct order of data transmission. 
The source then waits to receive and ACK packets before transmitting additional 
packets. 

The number of packets a source can transmit before needing to receive at least 
one ACK packet is the window. When a TCP connection is initiated and as it 
continues, the receiver sends an ACK packet which lets the source know the highest 
sequence number it is able to accept given buffer memory and system constraints. 
The source then sends the number of packets to flt that window and waits for an 
ACK. For every packet conflrmed, an additional one is sent and the window size is 
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maintained. The window size can be changed by the receiver in every ACK packet 
by varying the highest sequence number it can receive so the window often varies 
over the course of a transmission. If an ACK for a packet is not received within a 
timeout period, TCP considers the unacknowledged packets lost and retransmits it. 
Because of the possible need for retransmission, TCP must buffer all data that 
has been sent but has not received an ACK. The size of this buffer at the sender is 
usually calculated by the bandwidth delay product which is the product of the link 
bandwidth and the RTT. Therefore, for high bandwidth links or long RTT links, 
the buffer can become increasingly large and burdensome on the operating system. 

7.2. Congestion control 

Congestion control is one of the most prominent and differentiating aspects of TCP 
as compared to UDP and any other transport level protocol. Flow control uses 
ACK feedback to coordinate smooth transmission between the sender and receiver 
and congestion control uses feedback from the network throughput environment to 
adjust the sender rate in order to not cause network congestion. Thus Internet traffic 
which is largely TCP/IP behaves in part as a massive closed loop feedback system 
between the transmission throughputs of multiple senders which are modified by the 
measured traffic congestion environment. This has doubtlessly led to much research 
in self-organization in TCP traffic which will be discussed later. 

The congestion control algorithm is not uniform across all TCP software imple- 
mentations and has various flavors named after resort cities in Nevada including 
Vegas, Reno, and Tahoe. All implementations though essentially share four general 
features: slow start, congestion avoidance, fast recovery (not in Tahoe), and fast 
retransmit. 

Slow start is used to address the inherent problem that regardless of what TCP 
window the receiver advertises in its ACK packets, the network still may be so slow 
or congested in points as to not be able to handle that many packets transmitted 
over such a short time. Slow start handles this by overlaying another window over 
the TCP window called the congestion window (cwnd). At first, this window starts 
at one packet and tests to see if an ACK is received. If so, the congestion window 
grows to two, waits for two ACKs, and then grows to 4, increasing by powers of two 
at each successful step. The sender will eventually use the smaller of the congestion 
window or the window advertised by the receiver as the TCP window. This insures 
that if the receiver's window is too large for current network performance, the 
congestion window will compensate. Slow start has a threshold though which is the 
maximum size possible for any congestion window. 

Congestion avoidance works in tandem with slow start. Congestion avoidance 
assumes that any packet loss (packet loss is normally assumed to be much less than 
1%) signals network congestion. The connection measures packet loss by a timeout 
or duplicate ACK packets. If congestion is detected, congestion avoidance slows 
down the TCP connection by setting the slow start threshold to one half of the 
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current congestion window, the so-called exponential backoff. If a timeout caused 
the congestion, the congestion window is "reset" down to one packet and slow 
start repeats. Slow start increases the window size to the slow start threshold and 
then congestion avoidance takes control of the congestion window size. Instead of 
increasing the window size in an exponential manner, congestion avoidance increases 
it in increments with every ACK received according to the following equation 

segment size 

* segment size (27) 



congestion window size 

where the segment size is the size in bytes of data TCP fits into each packet. 
Therefore the congestion window increases linearly controlled by the congestion 
window algorithm. 

Fast retransmit is where TCP uses the number of duplicate ACKs received 
to determine whether a packet was received out of order at the receiver or likely 
dropped. If three or more duplicate ACKs are received it assumes the packet was 
lost and retransmits it. This prevents TCP from having to wait the entire timeout 
period before retransmitting. Since fast retransmit is based on the assumption of a 
lost packet, congestion avoidance comes into play. However, fast recovery takes over 
in the situation where fast retransmit is used and allows the TCP window to not 
decrease all the way to one and restart using slow start but by setting the threshold 
to one half of the congestion window size and starting the congestion window size 
at the threshold size + 3 x segment size. The congestion window then increases by 
one segment for each additional ACK. 

7.3. TCP macroscopic behavior 

The intricacies of the operation of TCP have lead to much research characteriz- 
ing the protocol's average or expected performance and its influence on the overall 
traffic patterns of the Internet. One of the best-known and widely reported results 
is a famous equation for the maximu m p ossible throughput for a TCP connection 



developed in early versions by Floyd (|62l ) and Lankshman fc Mad how (J98l ) and in a 



most widely known version by Mathis, Semke, and Madhavi (jllll ). They explore the 
expected performance of TCP against a background of random, but constant prob- 
ability, packet loss given the window resizing by the congestion avoidance backoff 
mechanism. Their famous result (often called the SQRT model) is for the theoretical 
maximum throughput for a TCP connection and is given by 

rr. MSS C 

T< (28) 

- RTT ^ ^ ^ 

where MSS is the maximum segment size , typically defined by the operating 



system for TCP and is usually 1460 bytes (|l44l ). p is packet loss percentage, and C 



is a constant which varies based on assumptions of periodic or random period loss 
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and the handling of ACK by a congestion avoidance algorithm. Since C is usually 
less than 1 the equation can be simplified to 

rr. MSS 1 

T < (29) 

RTT VP 

This equation assumes packet loss is handled by congestion avoidance and de- 
termined by receiving duplicate ACK packets, not packet loss via timeouts. Though 
this is the most famous and widely used equation, a more accurate one, esp ecially 



when p > 0.02 was introduced by Padhye, Firoiu, Towsley, and Kurose (|l3l[ ). This 
equation, based on a version of TCP Reno, also incorporates packet loss due to 
timeouts which is more realistic for higher packet loss situations. Their equation 
yields an approximation for throughput where 



T « mill 



Wm 1 

^^'^ RTTy2^+ tomin fl, 3^/^^ p (1 + 32p2) 



(30) 



This equation is also known as the PFTK equation. Here p is once again packet 
loss, Wm is the maximum window size advertised by the receiver, b is the number of 
packet s ack nowledged by each ACK (usually 2), and to is the initial timeout value. 



In (|l3ll ) the authors compared the fit to their equation versus the SQRT equation 



for real data and state PFTK fits better. A more complete analysis and comparison 



was conducted by El Khayat, Geurts, and Leduc (|50f ). They note both equations 
neglect slow start which makes them inappropriate for very short TCP flows and the 
equations also neglect fast recovery. To test the models they generated thousands 
of random networks with random graph topologies where the number of nodes 
(10 - 600) was chosen at random and the bandwidth (56 kbps - 100 Mbps) and 
the delay (0.1ms -500ms) were chosen randomly for each link. They then tested 
the TCP throughput on these virtual networks versus both equations and made 
comparisons using the mean squared error, R^, the over/under estimation ratio of 
average calculated throughput to actual throughput, and an absolute ratio which 
takes the larger of the first ratio or its inverse. For all metrics, the PFTK equation 
performed better but was still a poor predictor of actual TCP performance giving 
incorrect estimates roughly 70% of the time. 

7.4. Transport Layer Cause: TCP Congestion Control 

As discussed earlier, there has been interest in whether self-similar traffic can find 
its causes in the congestion control of TCP rather than at the application level. This 
has not been a generally accepted or verifie d cau se or contributor to self-similarity, at 



least on long time-scales. Veres and Boda (|17ll ) first brought up the conjecture that 



since the assumption TCP stochasticity or predictable periodicity itself in these 
equations is highly flawed, TCP throughput then cannot be reduced to a closed 
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form equation, and that TCP instead exhibits deterministic and chaotic behavior. 
In addition, most analyses look at TCP only at the single link level instead of 
treating it as a network dependent entity given congestion control. They based their 
papers on simulations that showed self-similarity, sensitivity to initial conditions. 



strange attractors, and stable periodic orbits appeared. Fekete and Vatay (|57l ) also 
used simulations to show that the interaction of TCP with buffers in routers can 
also cause chaotic behavior in the TCP flows. They simulated the interaction of N 
different TCP flows with a buffer that had the capacity to hold a fixed number of 
packets. They show that the backoff algorithm of TCP caused by lost packets can 
cause power-law behavior in packet interarrival times and chaotic dynamics given 
the (buffer length)/(# of TCP packets) ratio is below a critical value of 3. 



Similarly, Haga et. al. (|77l ) used simulations to recreate self-similar traffic and 
long-range dependence by the interactions of multiple TCP flows at a buffer on a 
central router connecting three different hosts. They assume an effective loss packet 
loss rate (real packet losses and RTT exceeding the allowable timeout period) but 
have an infinite sized router buffer so there is no real packet loss but a large effective 
packet loss due to timeouts. This model can produce self-similar traffic with H 
= 0.89 without any assumptions of ON/OFF distributions or file sizes beyond a 
constant TCP fiow size of 1000 packets. The only extra assumption is stochastic 
source and destination of TCP flows among the three hosts. 

These methods, particularly relying on simulations or analytical reasoning and 
not actual Internet traffic trace methods have been heavily criticized, especially 
by those favoring an application layer explanation for self-similarity. In (60l: l6ll). a 
more comprehensive TCP model is developed accounting for both the backoff phase 
and congestion avoidance phase after slow start. Their argument is that TCP can 
generate correlation structure and possibly self-similarity but only short timescales 
(up to 1024xRTT for high packet loss) and not arbitrarily long time scales which 
characterize most self-similar traffic. Their models shows that at low packet loss 
rates the correlation structure is dominated by congestion avoidance after slow 
start while the exponential backoff governs the correlation structure at high loss 



rates. In (J76|), the authors focus on short TCP flows and thus only model slow 
start and backoff. They again show self-similarity at large enough packet loss rates 
but though the long-range dependence that is also present would connote infinite 
variance, the TCP based self-similarity only extends over certain short timescales. 
Thus the authors dub this self-similarity pseudo self-similarity since its timescale is 
relatively limited. Veres et. al. answer these criticisms in ([l72). Here they concede 
that though TCP's congestion control may not by itself be the cause of LRD in 
Internet traffic, they show through data, simulation, and mathematical arguments 
that TCP's congestion control suite may propagate self-similar traffic along its path 
if it encounters a bottleneck that limits its send rate and has self-similar traffic. 
Therefore even if TCP can't create the full self-similar effect, it may be responsible 
for propagating the self-similarity far beyond the traffic it o riginated at. 



In a variation of the above research, Sikdar and Vastola ()153l : 11521 ) give a model 
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where self-similarity and long-range dependence emerge from the dynamics of a 
single TCP flow instead of multiple flows. They model a single TCP flow as the 
superposition of Wmax ON/OFF processes where Wmax is the maximum window 
size advertised by the receiver. This is similar to the earlier ON/OFF model but 
they show that for higher packet loss rates, a higher Hurst exponent and more self- 
similar traffic is generated. In addition, there are other papers detailing possible 
mechanisms by which TCP congestion control can give rise to self-similarity in 
Internet traffic. 

In the end, TCP congestion control has not emerged as a favored cause of self- 
similar traffic. Perhaps it is a strong or even dominant factor of self-similarity at 
relatively short timescales but is likely not the cause of the pervasive self-similar 
traffic described in most papers. One final note not connected to the idea of self- 
similarity is that TCP is one of the few ways that traffic actively couples to the 
network topology via the dependence of throughput on RTT. Though most latency 
in networks is likely caused by congestion and other network conditions, given simi- 
lar bandwidths and delays, the average shortest path (number of hops) in topology 



can affect average RTT as shown in (|56f ). In summary, TCP being the dominant 



protocol on the Internet is one the main determinants of the traffic dynamics. How- 
ever, TCP is a complicated and feedback driven protocol whose actions can only 
be partially estimated using analytical or stochastic models. The TCP protocol will 
definitely hold promise in the future for those looking for more intricate complex 
phenomena or pattern formation in Internet traffic dynamics. 

8. Theories of Phase Transitions and Critical Phenomena in 
Networks 

The most active area of work by physicists in the research of network dynamics is a 
group of research which merges the new insights of Internet traffic behavior with the 
mature and well-tested tools of statistical mechanics and critical phenomena. Similar 
to papers written on vehicle traffic (|89l : Ull : l30l ) these papers have analyzed the onset 



of congestion in networks as a phase transition from a free-flow to congested state 
determined by a critical parameter. In fact, an explicit comparison was given in 
(|70f). The papers, in general, deal with three broad, though sometimes overlapping, 
themes regarding the onset of congestion. First, are the papers that analyze the 
onset of congestion as a function of the packet creation rate for various topologies 
and also whether the self-similar structure of traffic can be reproduced in these 
models. Second, are models primarily concerned with investigating the rise of self- 
organized, emergent phenomena in networks in the critical state and linking the 
studies of congestion with the study of self-organization in general. Finally, are 
many papers who investigate how different routing strategies can delay or affect the 
onset of congestion. The papers of the last category often overlap with the first. 
Below I will describe some of the most often quoted papers. A more comprehensive 
list and reference of papers is given in table [S] for those wanting to delve into the 
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topic in more detail. In the papers described, the critical parameter is typically the 
packet creation rate. This has different symbols depending on the paper but here 
we will describe it as A. Finally, in the next section I will note many common, and 
unfortunately many times accurate, criticisms and problems with these models. 

Papers by physicists investigating congestion first concentrated on the onset of 
congestion as a critical phenomenon and possible links between this and the self- 
similar nature of Internet traffic. With few exceptions, these papers focus on the 
link or network layer dynamics (IP) as the source of critical phenomena in Internet 
traffic. One of the first papers to deal with a phase transition model of Internet 



traffic was by Csabai in 1994 (|40l ). In this paper, Csabai noted the presence of a 
1// power spectrum for the RTT times for pings between two computers where the 
fitted slope is -1.15 (about an H of 1.08). He also is among the first to compare 



Internet data traffic with vehicle traffic (l70l ). It must be noted that the RTT from 



ICMP echoes is not always equivalent to the RTT in TCP since many gateways give 
preferential forwarding to TCP packets. Also, this power law spectrum based off of 
ICMP echoes is different from the overall traffic whose self-similarity was discussed 
earlier. 



Takayasu, Takayasu, and Sato (11651 ) followed up with a similar study where they 
also note the 1// distribution of RTT for ICMP pings between two computers if 
there are many gateways on the route between them, likely because of consecutive 
jamming due to filled buffers. For a short route, their echo replies are distributed 
1//^ at low frequencies and as white noise at higher frequencies (/ > lO^'*). They 
extend the analysis though to include a theoretical derivation of the behavior of 
network traffic taking into account a simple topology. They disregard loops and use 
the theoretical topology of a Cayley tree where gateways are sites and cables are 
links. A contact process (CP) is modeled where empty sites are considered jammed 
gateways and filled sites (particles) are considered un-jammed gateways. A jammed 
gateway has a probability p of becoming un-jammed if it is adjacent to an un- 
jammed gateway (particle reproduction) and an un-jammed will become jammed 
with an independent probability q (particle annihilation). An un-jammed gateway 
will do neither with probability 1 — p — q. In analyzing the simulation, they assume 
that the number of un-jammed gateways over time is equivalent to the distribution 
of RTT. They derive a power-law result from the CP process which shows the 
distribution of jammed sites over time follows a t^" power law distribution with 
time when a parameter 6 = 1— p/q equals and that this power law yields 1// 
noise for the conditions < a < 1. A coniparison of ICMP echo RTTs to earthquake 
aftershocks is made by Abe and Suzuki (jl|) who fit the RTT from pings in Internet 
traffic to a statistical distribution that is similar to Omori's Law which models the 
arrival of aftershocks from an earthquake. 

A similar hierarchical tree topology is used to investigate critical behavior for 
data flow by Arenas, Diaz-Guilera, and Guimera ([9|). They derive a mean- field 
theory solution for the critical packet creation density and also show that most 
congestion occurs at the root of the tree and the first level of branching. Power-law 
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scaling of the total number of the packets in the system is observed near the critical 
point Ac- 

Takayasu & Takayasu later e xpand on a th eor y of self-similarity in Internet 
traffic as a critical phenomenon in ( 166l : ll64ll63r ) In ( 164 ) Takayasu, Takayasu, and 
Fukuda describe what they believe is a phase transition in the flow of overall Internet 
traffic. They separate data traffic into 500 s bins and take autocorrelations of each 
bin comparing the correlation length in seconds with the mean traffic density. The 
correlation length increases with traffic flow density until a critical density Ac = 500 
kbytes/sec where the correlation length begins decreasing again. They associate this 
with a second order phase transition in the flow where there is a transition from free 
to congested flow. When they consider any flow above 300 kbytes/sec as "congested" 
around the critical point they can show power law scaling of len gths of congestion 
times confirming the critical nature of the phenomenon. In (jl63l ). the same authors 
theorize that the critical nature of traffic measured in Ethernet networks is due to 
the Ethernet collision detection management algorithm (CSMA/CD) which employs 
an exponential backoff algorithm on detection of an Ethernet frame collision that is 
qualitatively similar to the congestion backoff mechanism described in TCP. They 
show that a binary backoff algorithm can generate 1// traffic distributions at the 
critical point. 

Most of the other prominent papers in the first category follow in the tradition 
of the first Takayasu paper describing phase transitions using modeled networks 
in simulations to i nfer a sh ift i n dynamics ab ove a certain packet creation/flow 
threshold IH; IHi; \^ [l2^; ^ ^5Q, [Tsl; [19^ El)- The threshold can be changed 
either by topologies, which are more efficient with small world networks such as 
random graphs or scale-free graphs than lattices or Cayley trees, by changing the 
proportion of nodes that can generate (versus only route) traffic, or by routing 
strategies such as in (|l29l : llQq ). In (JGSl : llSa llSSi ). mean field methods are used to 
calculate the onset of congestion for a 2D lattice at 



Ac = — 
pL 



(31) 



where p is the proportion of nodes that can generate packets and L is the length 
in nodes of a side of the lattice. All of these papers purport to show 1// distributions 
of packet travel times which they link to self-similar traffic. A paper looking at the 
prob lem from a different angle by Moreno, Pastor-Satorras, Vaquez, and Vespignani 
(|l20l ) approached the problem by looking at what average traffic density in the 
overall network could lead to a spread of congestion across all nodes and the collapse 
of the network. This is a related viewpoint on the cascading route r fai l ures and 
percolation models that have been studied on scale- free topologies (123J; IllQt l39f ) 
which links cascading failures not just to topological sensitivity of certain hubs but 
also the traffic levels in the network. 

In the second category, are papers largely concerned not with the value of the 
critical parameter but with emergent phenomena themselves. One of the earliest 
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papers hinting at this was a study by Barthelemy, Gondran, and Guichard (|l5l). 
Borrowing techniques from nuclear physics, they studied the eigenvalue distribution 
and eigenvectors of the traffic correlation matrix of 26 routers and 650 connections 
in the Renater computer network for two weeks of traffic data. Their technique 
used random matrix theory to compare the eigenvalue distribution of the correlation 
matrix of Renater traffic fluctuations to that of a control random matrix. The traffic 
ftuctuations in an interval r in the traffic between source i and destination j were 
defined as 



gy(t) =log 



F^jit + T) 



Frjit) 

and the correlation between connections ij and kl is defined as 



(32) 



^ {9tj9ki) - {gt]){9ki) .„„. 

<^(ij)(fcO - (33) 

They found that the largest eigenvalues were much larger than the largest eigen- 
values of a similar rank random traffic matrix whose flows have a mean of and 
unit variance. Also, the largest values of the eigenvector for the largest eigenvalue 
correspond to the most highly correlated routers. These results all indicated spa- 
tiotemporal correlations among the routers in the network that deviated from traffic 
defined purely by a stochastic process. 

Among the most consistent resea rchers to address t he emergent phenomena 



question directly are Yuan and Mills (jlSSt 1X891 : 11901 : Il92f l who make a persuasive 



case that emergent phenomena in networks could go beyond the simple onset of 
congestion in simple network topologies and only treating packets at the network 
(IP) layer. The main themes of their papers are measuring spatiotemporal patterns 
that emerge in larger networks. The main features they add lacking in many other 
models are size (more nodes), more realistic topologies, as in f(190f ). where their net- 
work includes four levels of hierarchy in tree s truc ture, and modeling of transport 
(TCP) level effects such as congestion control (jl90l ). In their first paper (|l88l ). they 
use a simplified topology of a 2D cellular automata (CA) with all nodes as hosts 
and routers. The state of a router on the CA is defined by the number of packets in 
its queue and it "transitions" by passing off packets given the state of the queues of 
the surrounding cells. The t raffic sour ces can originate in any node and are modeled 
as ON/OFF sources as in (|l79t llSOl ). Packets are routed via a full routing table. 



In addition, they model the systems with three types of congestion control algo- 
rithms: no congestion control, a congestion control that stops transmitting above 
a threshold RTT per hop to the destination, and a TCP-imitating congestion con- 
trol that includes slow start and congestion avoidance. Their main results use the 
TCP-imitating congestion and produce power spectrums of the time series of the 
number of received packets at a given node for various sample time lengths and 
network sizes. In general, they find that increasingly longer sample times diminish 
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the correlations and long-range dependence measured in the power spectrums but 
increasing the network size increases the correlations over both time and space. 
Comparing smaller networks to similar sized subgraphs in larger networks shows 
that the subgraphs exhibit stronger correlations and they deduce that large network 
sizes can allow for wider coupling and self-organization. They also surmise larger 
networks may be more predictable because congestion is stable over longer time 
scales. 

In subsequent papers, this idea is developed further. In (J189I ). the authors do 
an analysis by creating a weight vector for each node that is constructed from the 
components of eigenvectors derived from the correlation matrix as in pSJ). Yuan 
and Mills create a technique to analyze simulated networks of a larger size. They 
define flow vectors, Xi, where i ranges from 1 to N where N is the number of nodes 
with N components each component Xy representing the flow from node i to node j 
during a sample interval. They then create a normalized flow vector by normalizing 
each element of each vector for the entire sample time including all intervals where 
the normalized vector is 



/., = ^^^^^ (34) 

They analyze the eigenvector of the largest eigenvalue of the correlation matrix 
among all normalized flow pairs over time and use the elements of the subvectors 
of this eigenvector to create N S vectors where Sij is the relative contribution of 
node i to node j in terms of traffic correlations. Performing simulated traffic on a 
2D topology with ON/OFF sources with Pareto distributed ON times and TCP 
congestion control they observe complex fluctuation of the largest eigenvalue over 
times as well as correlated traffic between certain nodes over time though they 



note their largest eigenvalues tend to be smaller than those in (|l5r). They also raise 
the point that during congested critical states, taking a sample of a few nodes (or 
routers) may give a better and more overall cohesive picture of the entire network if 
sampling all nodes is infeasible. This is mainly due to the increased spatiotemporal 
correlations in congestion. Also, longer sampling time windows tend to reduce the 
visibili ty of correlations in traffic. 

In (|l9Clh . they continue the same research based on the eigenvalue method but 
using a four-tier (backbone router, subnet router, leaf router, and source hosts) hi- 
erarchical network to model the actual AS-level and below topology of the Internet. 
They also use the measures of spatiotemporal correlation to find both hotspots and 
show that distributed denial of service (DDoS) attacks can cause large-scale effects 
beyond the target router by disturbing traffic flows in other correlated routers in the 
network. They suggest methods of analyzing network-wide phenomena using small 
samples of nodes and possibly detecting DDoS att acks by the signatures of large- 



scale perturbations in correlated network traffic. In (|l92r ). they return to the 2D CA 



formalism but investigate spatiotemporal dynamics using wavelets and logscale dia- 
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grams over varying average packet creation rates, congestion control protocols, and 
average flow durations. They looked for causes of LRD at the application level (file 
size distribution) , transport level (congestion control type) and network level (vary- 
ing the rate of ON/OFF sources and network size). They found that LRD emerged 
on wide times scales with long-tail distributions of file sizes, an increasingly large 
network size, or Pareto distributed ON/OFF source times but only emerged on lim- 
ited time scales when only the type of congestion control was varied. Though they 
acknowledge the limits of their model they suggest that most LRD emerges due 
to interactions in the network layer or possibly file-sizes in the application layer. 
Like (|60l : l6ll : Ww they suggest co nges tion control plays only a limited par t in the 
emergence of LRD. Yuan et. al. (|l95[ ) closely rephcate the results of (jl89h except 
they compare visualization of the largest eigenvalue over time with the informa- 
tion entropy of the weight vectors. They find the eigenvalue more clearly shows the 
change in correlation structure over time. There ar e als o some very interesting spa- 
tiotemporal plots of router congestion over time in (|l93l ) showing pattern formation 
in the temporal congestion among routers in a ID cellular automaton model. 

Following the theory that the onset of congestion could be considered a criti- 
cal phenomenon, investigations began on possible new routing strategies that could 
help extend the tolerance of a network to congestion. In short, all of the proposed 
routing strategies aim to be an improvement over current state Internet routing 
where routers use a global router table and shortest path metrics to route pack- 
ets. In particular, these papers show that the geodesic on the networks between 
two points defined solely according to a graph shortest path are not always the 
best routing paths in real traffic conditions. The newly proposed routing strategies 
tend to explicitly take into account traffic and/or queue conditions at neighboring 
routers or use different topological measures such as betweenness (481:147) in order 
to redefine the shortest path metrics and packet routing strategies. Besides random 
wa lk routing, the mos t common routing heuristics are next nearest neighbor (NNN) 
dj; II6II : Il62t Il59t II60I ) based on delivery to the destination if it is an adjacent node 
or either random walking or bias to higher degree nodes otherwise (except a packet 
cannot travel to a node it has just left). Typically it is shown tha t NNN is sup erior 
to random routing. Preferential next nearest neighbor (PNNN) ([183; Il76l : Isil : Isa ) 
slightly alters NNN by explicity taking into account node degree in routing accord- 
ing to a parameter a which creates a preference distribution for nodes of a degree 
k according to the relation fc", or a similar relation. In (j28l ) traffic congestion at 
nodes is also taken explicitly into account. Typically, there is a purported value of a 
which minimizes average packet travel time and tolerates a higher packet creation 
rate for the onset of congestion. 



Many of these papers, with some exceptions such as (J68l ). do not show how 
these new routing methodologies should compare against the current shortest-path 
full routing table the Internet uses for routing. What would the value of a random 
walk or walk based on the degree of an adjacent router add to network routing 
infrastructure and performance? Or is it worse than the current system (which 
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most comparisons suggest)? The basic idea of the geodesic between source and 
destination depending on traffic conditions is a very interesting proposal but how 
could it be implemented in practice? These questions should provide fertile ground 
for future research and cross-disciplinary collaboration. 

There have been some papers pr oposi ng feedback based routing in the network 



engineering literature (for example (|197I )). however, they are not related to simi- 



lar research in physics and this line of research by physicists is often not looked 
upon highly by the network engineering community as will be discussed in the next 
section. One final note from equation [T] is that since all of these models use the 
packet as the basic unit, the concept of the relationship between data throughput 
and packet size shows that apart from topology changes or new routing, as stated 
earlier, one easy way to reduce the packet creation rate on a network is to increase 
the average packet size. Since throughput is an important measure in the func- 
tion of the Internet, future measurements and experiments on packet creation and 
congestion should thoroughly account for this. 
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9. Criticisms of Various Approaches to Self-Similarity 

Though the physics literature on congestion and critical phenomena in networks 
is becoming increasingly sophisticated and adept at "reproducing" self-similar pat- 
terns seen in Internet traffic, there have been several valid criticisms, particularly 
from the network engineering community, that the methodologies may reproduce 
observatio ns bu t d o no t ta ke in to acco unt the actual workings of the Internet in 
detail ^ [isi; 0; [loi ^ [l4l|; i; [loi 0). In ^) Floyd and Paxson, though not 



addressing physics approaches directly, note that the while simulations are crucial 
to Internet research, the Internet is extraordinarily complicated and difficult to 
accurately simulate, especially on large scales. In particular, they point out three 
problems: the increasingly unpredictable behavior of IP over increasingly diverse 
network and applications, the massive and continuously increasing size of the Inter- 
net, and its penchant for changing in many drastic ways over time. Heterogeneity is 
the rule not the exception a nd m ariy activities such as periodicities are often left out 
of simulations. The papers ( 182l :l6l:l45h address the physics community more directly 
pointing out defects in the theories of critical phenomena and the hub vulnerabil- 
ity of scale-free networks respectively. In (J182I ). Willinger et. al. describe evocative 
models, which reproduce the observations using generic models, and explanatory 
models, whose applicability are tested by experiment and measurement (what they 
term "closing the loop"). They complain many models, from physicists and some 
engineers, are evocative and ignore the research on the particulars of Internet pro- 
tocols, function, and traffic that could verify or refute their model. These models 
instead act as a "black box" which try to come up with an appealing model or the- 
ory to match self-similarity but do not provide any prediction or verification using 
real world traffic trace data. For example, the largest problem in many of the phase 
transition models is that they demonstrate self-similar traffic only at critical loads 
while Internet traffic measurements show self-similar traffic in both free-fiow and 
congested regimes and at all levels of flow. In (jy), Alderson and Willinger further 
elaborate that they believe models from statistical mechanics are not applicable to 
the Internet which is designed by multiple economic and performance considera- 
tions and not by simple rules of self-organization often present in models such as 
preferential attachment. 

Barabasi- Albert preferential attachment model for the growth of scale-free net- 
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works is criticized since many of the highly connected nodes in the Internet are at 
the edges near final consumers rather than in the central parts of the high speed AS 
network. This valid criticism thought is partially answered by disassortative mix- 
ing, which shows that in non-social networks like the Internet, high degree nodes 
are more likely to connect with low-degree nodes and rules out a core of highly 
connected, highly vulnerable hubs. Many subsequent models of scale-free networks 
have taken this into account. Finally, (J45|) criticizes the research in topology that 
says scale- free networks are vulnerable to attack due to highly connected hubs, 
which they once again say is fallacious because despite power-law degree distribu- 
tions the most highly connected hubs are ofte n on the periphery of the Internet and 



not along its crucial backbone. Lee in (jl03l ) points out both the aforementioned 



problems with the critical phenomena models but also points out the ON/OFF 
model also has problems because since an ON/OFF source has a long-tail dura- 
tion time distribution, you will have a finite probability of an ON/OFF source as 
long in duration as any observation period you make. Lee also criticizes TCP mod- 
els for not accou nting for similar effects in UDP and other stateless protocols. In 



(|7tt |60|; |6l|; Il72l ). the possibility is also raised that that TCP can effect the traf- 
fic dynamics in a more market dynamic on shorter time scales (near the order of 
multiple RTT) and this is an area of future investigation and debate. 

In the authors opinion, this criticism of critical phenomena models is extremely 
valid and the self-similarity only at critical loads and lack of real world validation 
show many of the models in the previous section are unrealistic or misleading. 
Given the current facts and validation against real data, it seems that the current 
weight of the evidence for long timescale self-similarity lies with the application 
layer explanation based on ON/OFF sources. However, this does not mean the 
physics models based on statistical mechanics are completely useless. Wide scale 
Internet traffic measurements are nearly impossible currently and also large-scale 
theories of traffic are still relatively undeveloped. Currently almost all real data 
is traffic traces over one given link during a given time period which makes large 
scale Internet studies very difficult. Large-scale congestion, traffic correlations, and 
other complicated phenomena will probably draw useful lessons from physics models 
of self-organization and long-range correlations though more realistic models are 
absolutely necessary. The omission of TCP-like congestion control, except in a few 
models, must be rectified. If these criticisms sound a bit harsh, try to put yourself 
in the shoes of most network engineers who understand the intricate processes in 
detail of how the Internet operates. When shown a model of a 2D grid, no mention of 
congestion control, infinite router buffers, and self-similar traffic only in congested 
conditions, their incredulity is understandable. It is aggravated by the fact that 
almost none of these papers try to match results with or analyze real traffic traces. 

In defense of the efforts of physicists, however, I believe that physics started out 
correctly choosing simplistic topologies and dynamics scenarios that are both ana- 
lytically tractable and amenable to rapid simulation. However, though the earlier 
work of those like Takayasu began by looking at traffic traces, this soon disappeared 
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almost completely in favor of computer simulation. Despite the obvious shortcom- 
ings of explaining self-similar traffic, the demonstration that large-scale congestion, 
though on which timescale is unknown, may be a theory the physics and engineering 
communities should take note of fo r validation or refutation. 

With the work in (151; [iSQJ; Il90l ) showing large-scale correlations among router 



traffic in both real and simulated data, can we really look at the Internet dynamics 
from solely the viewpoint of a collection of single traffic traces? The question is 
not if the Internet displays large-scale correlations and self-organization well-known 
to complexity theory, but how these large-scale effects play out and if realistic 
simulations with both realistic dynamics and topology can predict effects that we 
have not yet observed or known how to look for. Much more cross-disciplinary work 
is needed in this direction. 

Though the analysis of the criticisms above seems like it tries to be even handed 
and please everyone while solving nothing, the nature of the problem is such that 
the issues regarding the core nature of Internet traffic cannot be easily resolved. 
Willinger et. al. are right in that the loop must be closed and just creating a simu- 
lation that outputs traffic with a Hurst exponent near 0.8 cannot be considered the 
final word in the "cause" of self-similarity in Internet traffic. In addition, though it 
is difficult and near impossible, large-scale and coordinated traces and models of a 
topologically and dynamically correct Internet is the next logical step in modeling 
and studying these phenomena. 

10. Other Interesting Phenomena 

10.1. Flows and fluctuations 

Barabasi and ArgoUo de Menezes from the physics community (^lOi ) proposed an 
interesting result when they announced a relationship between the average volume 
of the flow and its dispersion (standard deviation of traffic volume) among nodes 
in a network. In particular, they found that accounting for all nodes in a network, 
you find the average flux (/) and standard deviation a per node are related by the 
scaling relationship 

<J « (/)" (35) 

Where a is near either 1 or 1/2 for two types of systems. The traffic on nodes 
of a network of Internet routers and on/off state occurrence of junctions in a mi- 
croprocessor electronic network had scaling exponents of 1/2 while visitor traffic to 
a group of WWW pages, traffic at a group of highway junctions, and water flow in 
different locations in a river network demonstrated scaling exponents of 1. In two 
simulations, one based on random walks on a scale-free network and the other by 
simulating shortest-path traffic on a scale-free network, they were able to explain 
the scaling exponent of 1/2 as being based on the channeling of traffic through 
select nodes and arises from internal or endogenous network dynamics. The power 
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scaling exponent of 1 on the other hand is shown to be universal when the amount 
of traffic is driven by external forces as well as endogenous dynamics similar to an 
open system. They believe the power exponent of 1 is more universal than since it 
results from the interplay of endogenous and exogenous pressures. In a subsequent 



paper (jllh . they give a method of extracting the endogenous and exogenous traf- 
fic and propose a metric, rji, that defines the predominance of external or internal 
influences on traffic dynamics by the equation 



n. = -W (36) 

i 

Where rji 3> 1 indicates an externally driven system while ry^ <C 1 indicates a 



systems dominated by internal dynamics, rji can vary on different time scales as (|49l ) 
showed using trading records from the New York Stock Exchange where internal 
dynamics were dominant on the scale of minutes while external ones were dominant 
on the hours and days time scales. In (|l62l ). a power law-scaling relationship was 
also found via an NNN routing simulation on a scale-free network and scaling was 
demonstrated also exhibiting either an exponent of 1/2 or 1. 

The generality of the results and the universality of the clas ses p roposed in these 



papers has recently been disputed though. Duch and Arenas (J46f ) perform several 
measurements relating flow and fluctuations on data from the Abilene Internet 
backbone and claim that a varies between 0.71 and 0.86 and not 1/2 as represented 
in the first papers. They also propose that this derivation is due to the original 
papers disregarding congestion in networks and show analytically that for short 
timescales of measurement, an a of 1/2 is a trivial result but a false generality once 
the timescales are extended and other parameters come into play They conclude 
tha t the re is a scaling relationship but no universality classes as claimed. Meloni et. 



al. (J114l ) go even further and say that under certain conditions, power-law scaling 
between flows and fluctuations should be abandoned. They conduct a simulation of a 
random diffusion process on a network of packets measuring scaling as influenced by 
the time window of measurements, the degrees of the nodes flows and fluctuations 
are measured on, and the volume of packets in the network. They produce an 
analytical result that explains power scaling behavior between the two quantities 
only under the conditions where the noise fluctuations in the system and/or the 
time window size are relatively small. Otherwise a tends towards 1 and does not 
display power-law scaling. Also, they show even in networks with power-law scaling, 
a can scale differently at 1/2 for low-degree nodes or 1 for high degree nodes showing 
that within networks there may be varying scaling depending on the degrees of the 
nodes. 



Finally, Han et. al. in (|78l ) measure a for the download rates of an Econophysics 
web database and find an a varying from 0.6 to 0.89 depending on the length 
of the sample time window. They confirm the power law scaling between flux and 
fluctuations but do not flnd any universal exponents. From these results the research 
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between flows and fluctuations in networks is still in its earliest stages but holds 
out much promise for future progress. 



10.2. Internet worm traffic & BGP storms 



In ([1901) , a simulation by Yuan and Mills was touched on that aimed to try to predict 
part of the large-scale impact of a rapidly spreading Internet worm. Recent increases 
in the amount and sophistication of malicious code released on the Internet includ- 
ing the use of "zombie" computers for large-scale DDoS attacks has demonstrated 
this is far from just a theoretical exercise. An increasingly large literature base on 
the Internet traffic effects of epidemics has arisen, particularly after the Code Red 
outbreak in 2001 (which the author had the dubious honor of handling as a network 
security administrator at the time). Again, to stay with the scope of the paper the 
aspects of Internet worms discussed here will be tightly limited to effects on traffic, 
both measured and predicted, and will not delve into the volum inou s theoretical 



work of epidemiology on scale-free networks or other topologies (|138[ ) or much of 
the new literature with specialized epidemic models for computer worms. 

The two most studied Internet worms have been the Code Red (start: July 
19, 2001), Nimda (start: September 19, 2001), and SQL Slammer (start: January 
25, 2003). The Slammer, th ough not holding a malicious payload, was the fastest 
spreading worm in history (jllSl ). What has often been found is that the worms 
not only cause trouble for the computers they affect, they create large-scale traffic 
patterns that can disrupt the normal behavior of entire networks. Often, a worm 
spreads by exploiting a vulnerability in computers and will try to infect random 
computers by testing an IP address at random or due to certain rules. With poten- 
tially millions of computers sending out such probes at once it is easy to see how 
normal traffic patterns can be seriously disrupted. 

In particular, worms have often been the culprit of what could be termed a large- 
scale instability in the BGP routing system called a BGP update storm or BGP 
storm. In a BGP storm, the normal level of BGP updates sent to update the router 



table can rise by several orders of magnitude and sometimes disrupt traffic (J37l : 



Il48^ . For example, in (37; they describe how during the Nimda worm normal BGP 
update traffic of 400 advertisements per minute jumped to 10,000 advertisements 
per minute. This is not because the worms infected the routers themselves but 
because the worms caused large packet flows which overwhelmed the router memory 
and CPU limits and caused them to crash. These router crashes caused frenzied 
reorienting of the Internet router topology. BGP storms are interesting in that both 
traffic and topology is rapidly changing. BGP storms may be an avenue for both 
physicists and engineers to investigate the relationship between topology and traffic 
in a situation when both are largely in flux. 



Yuan in Mills expanded their work from (|l90r ) to a full paper l|l9ir ) that looks 
at spatiotemporal correlations between routers and hosts in several types of large 
scale DDoS attacks. They find that DDoS attacks may cause traffic variations at 
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correlated routers and hosts besides just the target. Because of these large traffic 
altering phenomena, certain spectral techniques have been researched to identify 
DDoS attacks. Some of these are summarized in the next section. 



10.3. Traffic oscillations /periodicities 

Earlier periodic behavior in Internet traffic was casually mentioned as theoretical 
assumptions of TCP traffic. Also, one of the consequences the self-similar nature 
of traffic is the 1// spectral behavior of the traffic. Beyond these, however, there 
are a plethora of traffic periodicities that represent oscillations in traffic over peri- 
ods of several scales of magnitude from milliseconds to weeks. Many of these are 
well-defined and classified. Their origin has two possible sources: first, software or 
transmission driven periodicities which range on the time scale of milliseconds, sec- 
onds, or in rare cases, hours. Second are user driven periodicities which range on the 
time scale of days, weeks, and possibly longer. This new area of research has been 
dubbed network spectroscopy^^ or Internet spectroscopy and is finding uses in ap- 
plications such as identifying traffic sources via traffic periodicity "fingerprints" to 
early detection of denial of service or other hacker attacks by detecting anomalous 
oscillations in the traffic spectrum similar to vibration analysis of faulty machinery. 
The causes and periods of various known periodicities are summarized in figure[5l 
The values can have a general range of deviation so the periods are not always exact, 
but are a good guide to the major periodicities. User traffic driven periodicities were 
the first known and most easily recognized. The first discovered and most well- 
known periodicity is the 24 hour diurnal cycle and its companion cycle of 12 hours. 
These cycles have been known for decades and report ed as early as 1980 and again 



in 1991 as well as in many subsequent studies(149.- 97l: ll33l : ll47l : ll24] : ll30: 65). This 



obviously refers to the 24 hour work-day and its 12-hour second harmonic as well as 
activity from around the globe. The other maj or periodicity from human behavior 
is the week with a period of 7 days (|l33t Il47l : l23l ) and a second harmonic at 3.5 
days and barely perceptible third harmonic at 2.3 days. There are reports as well 
of seasonal variations in traffic over months (79), but mostly these have not been 
firmly characterized. Long period oscillations have been linked to possible causes of 
congestion and other network behavior related to network monitoring (124 : 1301 ). 
One note is that user traffic driven periodicities tend to appear in protocols that 
are directly used by most end users. The periodicities appear TCP/IP not UDP/IP 
and are mainly attributable to activity with the HTTP and SMTP protocols. They 
also often do not appear in networks with low traffic or research aims such as the 
now defunct 6Bone IPv6 test network. 

The autonomous, non-user driven, periodicities operate mostly at timescales 
many orders of magnitude smaller than user behavior. At the lowest period, and 
correspondingly highest frequency, are the periodicities due to the throughput of 
packet transmission at the link level. This has been termed the "fundamental fre- 
quency" l|79l ) of a link and can be deduced from the equation: 
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/ = ! (37) 

Where T is the average throughput of the hnk and s is the average packet size 
at the hnk level. A quick inspection reveals this equation is identical to that for 
the flow rate given by equation [TJ Indeed, the fundamental frequency is the rate 
of packet emission across the link and is the highest frequency periodicity possible. 
The theoretical maximum fundamental frequency is given by 

Jmax = nrrjiTj {y°) 

where B is the bandwidth of the link and the packet size is the MTU packet 
size. Therefore for 1 Gigabit, 100 Mbps, and 10 Mbps Ethernet hnks with MTU 
sizes of 1500 bytes, the theoretical maximum fundamental frequencies are 83.3 kHz, 
8.3 kHz, and 833 Hz respectively. 

The usual measured fundamental frequencies via power spectrum diagrams are 
lower than the theoretical fundamental frequencies due to lower throughput. The 
fundamental frequency also generally displays harmonics as well (l7y). 

Broido, et. al. |2l| ) believe there are thousands of periodic processes in the Inter- 
net. Among other prominent recognized periodicities are BGP router table updates 
sent every 30 seconds, SONET frames transmitted every 125/xs, DNS updates trans- 
mitted with periods of 75 minutes, 1 hour, and 24 hours due to default settings in 
Windows 2000 and XP DNS software Q, and in TCP flows ACK packets at a 



frequency of 1/RTT (l2l|; [19|; \M) with RTTs usually ranging from 10ms to Is. 

The main practical applications being researched for network spectroscopy are 
inferring network path characteristics such as bandwidth, digital fingerprinting of 
link transmissions, and detecting malicious attack traffic by changes in the frequency 
domain of the transmission signal.! 88tl80l) use analysis of the distribution of packet 



interarrival times to i nfer congestion and bottlenecks on network paths upstream. 
In ( 29t l27l: l84l: l83Hl06l ) various measures of packet arrival distributions, particularly 



in the frequency domain, are being tested to recognize and analyze distributed 
denial of service or other malicious attacks against computer networks. Inspecting 
the frequency domain of a signal can also reveal the fingerprints of the various link 



level technologies used along the route of the signal as is done in ()21t |3 



10.4. Biological/ ecological models and Internet traffic 

Comparisons of the Internet to biological or ecological systems are legion and range 
from the theoretically precise to philosophical speculations in both popular fiction 
such as William Gibson's Neuromancer and Masamune Shirow's Ghost in the Shell 



72t Il51l ) as well as in the opinions of some researchers such as Vernor Vinge's 



"Singularity" dlT: 
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Fig. 5. A rough breakdown of the major periodicities in Internet traffic showing the responsible 
protocols and their period in seconds. The periodicities span over 12 orders of magnitude and 
different protocol layers tend to operate on different time scales. 



The focus here is on scientific papers which have used mathematical models, 
biological or ecological, to describe functions of the Internet or compare certain 
functions to physical systems. T he gr owth of the Internet's nodes in terms of a 
birth/death process is covered in ( 139l ). In (|69h Fukuda, Nunes-Amaral, and Stan- 
ley use several statistical analyses to show a striking similarity between variations 
in daily Internet active connections in a data trace and statistics on heartbeat 
intervals. By both separating both non-stationary time series into stationary seg- 
ments and using DFA, they show that the magnitudes of activity for night time 
(non-congested) Internet connections and healthy heartbeats are statistically very 
similar. Likewise day time (congested) Internet connections and diseased heartbeat 
intervals are also similar in their fluctuations. They propose that a general non- 
linear systems explanation underlies both systems and given that the heart rate is 
controlled by the autonomic nervous system, understandings of Internet functions 
and properties could be used to study the autonomic nervous system as well. 

Several authors have also used ecological interaction models such as mathemati- 
cal models of comp etiti on and mutualism to study interaction between web sites and 



search engines. In (|112l : l95l :l96l). competition and cooperation between web sites are 
analyzed using the n-competitor Lotka-Volterra differential equations from ecology. 
The steady state of "winner takes all" or multiple participants is e xtra cted from 



stability criterion and compared to actual market competition. In (J174I ). another 
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.Name .Data 



.CAIDA .Probably largest and most comprehensive respository of all types of In- 

ternet data and research. Hosted by UC San Diego 

.NLANR .Older traffic trace project; now mostly housed at CAIDA 

.WIDE MAWI .Japan's WIDE Project trafBc trace archive; data source of many graphs 
in this paper 

.PingER .Stanford project to monitor Ping response in IPv4 and IPv6 across the 

Internet 

.RouteViews .U of Oregon's database on Internet routing tables and BGP data 

.tcpdump .Main program used to collect traffic for analysis; used in many packet 

sniffing programs 

.ns-2 .Commonly used network traffic simulator in the network engineering 

community 

Table 7. Common Internet traffic data sources & software. 

model which includes a cooperation effect is introduced to study the same dynam- 
ics. The interesting analogy b etwe en search engines and websites as a mutualistic 
relationship is introduced in (il75[ ). The postulated mutualism is obligate for the 
search engine and facultative for the websites. This is similar to the sea anemone 
and hermit crab or mycorrhiza and plant mutualisms in nature. They show that 
strong mutual support for web sites by search engines and vice versa offers the best 
opportunity for long-term sustainability and growth. 

The last paper covered in this section is a recent publication which draws simi- 
larities between the energy use and scaling of information networks and metabolic 
scaling phenomena suc h as Kleiber's Law, the 3/4 power law scaling of organism 



mass and metabolism (J122I ). Though the bulk of the paper is comparing the cir- 
cuitry density and area for electronic circuits microprocessors, they derive, with a 
limited set of data points, a scaling relationship between the total processing power 
of hosts on the Internet and Internet backbone bandwidth with a scaling expo- 
nent of about 2/3. Future research in this direction, especially if a valid scaling law 
relating topology and dynamics is discovered, will surely be very fruitful. 

11. Conclusion 

With every passing year, research is making us more and more aware of the complex 
dynamics and interplay of factors on the Internet. Though many may haphazardly 
use the terms self-organization, emergence, or power law this review has hopefully 
laid out the concrete facts about what is known clearly about Internet traffic, what 
is less clear, and where many new paths can be beaten. Unlike most systems which 
are amenable to constant analysis over long time periods, the Internet is ever chang- 
ing. What we understand today may not completely apply several years from now. 
In addition, our knowledge of long range correlations and dynamics among multi- 
ple sites and links is still in its infancy. Congestion is the only emergent property 
which has been studied in any detail and it remains to be seen if it is the only one 
that exists. There is much room for speculation on these matters without being 
irresponsibly fanciful. 

At the core, these issues are more than academic since the long-term efhciency 
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and stability will require us to understand the Internet and its traffic well enough 
to optimize it for the ends of users. Advances in understanding the Internet are also 
enabled and constrained about our knowledge of nonlinear dynamics and complex 
systems in general. As more themes and discoveries about these systems emerge, 
they will doubtlessly provide us with more tools with which to investigate the In- 
ternet and uncover more of the story behind its dynamics. Finally, as mentioned 
earlier, it is essential for more cross-disciplinary cooperation to take place in order 
to accelerate our understanding of Internet phenomena. The two groups have co- 
operated in some areas and are hardly irreconcilable. Combining both toolkits can 
definitely bring forth some more surprising and rewarding results. 

Finally, though this paper has been heavy on esoteric technical aspects of the 
Internet, we must not lose sight of the whole, as the poet Walt Whitman once 
wonderfully wrote (jlTSh . 

When I heard the learn'd astronomer; 

When the proofs, the figures, were ranged in columns before me; 

When I was shown the charts and the diagrams, to add, divide, 

and measure them; 
When I, sitting, heard the astronomer, where he lectured with much 

applause in the 
lecture-room. 

How soon, unaccountable, I became tired and sick; 
Till rising and gliding out, I wander 'd off by myself. 
In the mystical moist night-air, and from time to time, 
Look'd up in perfect silence at the stars. 

For everything said about self-similarity, phase transitions, and related matter 
we must never lose sight of the Internet as the wonderful invention it has been in its 
cultural, economic, and technological aspects uniting those from around the world. 
Even if in only a small part, this should animate and encourage our research into 
the future. 
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